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Preface 


Learning and memory processes are basic features of human existence 
and are also reported in other species (Clayton & Dickinson, 1998). They 
allow us to (un)consciously adapt to changes in our social and physical 
environment in a variety of ways and may have been a precursor for sur- 
vival in human evolution. Although learning and memory processes have 
been at the center of psychological, linguistic and philosophical research, 
and discussed from the earliest existence of these disciplines, there is still 
much to learn. 

In the domain of speech production and perception, the focus of the 
present book, there has been a renaissance in terms of the subject’s matter. 
Three major topics will be addressed in this book through reviewing pre- 
vious work; discovering research gaps and summarizing potential future 
research directions; or with original work. These three major topics are: 1.) 
the role of real-time sensory (auditory) feedback for learning, 2.) the role 
of motor aspects for learning and memory (including recent technological 
developments which may support learning in people with specific needs) 
and 3.) representations in memory and the role of sleep on memory con- 
solidation with a specific focus on second language learning. 

Regarding the first topic, computational and technological developments 
in recent years have made it possible to alter sensory feedback of a speaker 
in real-time. That is, a speaker’s spoken language can be recorded, manip- 
ulated, and played back with such a short delay that the speaker considers 
it as his/her own speech. These developments made it possible to investi- 
gate the role of auditory feedback in speech production and learning and 
determine how and when speakers adapt to changes in auditory feedback. 
The first two chapters of the book focus on this topic. Tiphaine Caudrelier 
and Amélie Rochet-Capellan provide a review of two decades of research 
initiated by Houde & Jordan’s (1998) pioneering study on auditory-motor 
learning in response to formant perturbations. The chapter starts with 
an overview of the impact of Houde and Jordan’s work across different 
research fields. Then, based on 77 studies using formant perturbations, 
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the authors present the systems and procedures associated with this par- 
adigm. They also provide a comprehensive review of the research topics 
addressed by these studies and their main results. The chapter concludes 
with suggestions for future research, including using sensorimotor learning 
to further explore the nature of speech production representations. 

The second chapter presents a recent study on real-time feedback per- 
turbation by Eugen Klein, Jana Brunner, and Phil Hoole. They work on 
inter- and intra-individual variability of adaptation processes during audi- 
tory perturbation of vowel formants. Specifically, the authors investigate 
the influence of experimental task demands — such as the alternating per- 
turbation of the second formant and the consonantal context of the per- 
turbed vowel - on speakers’ compensatory adjustments. Examining the 
adaptation process with due regard to its temporal dimension, the authors 
show that its variability is strongly associated with speakers’ exploratory 
behavior and cannot be exclusively ascribed to the characteristics of 
speakers’ internal models of speech motor control. 

The next three chapters deal with the particular role of motor aspects 
in learning and memory consolidation. Different populations are inves- 
tigated. In a review of literature on speech sound errors in people with 
autism, Louise McKeever, Joanne Cleland and Jonathan Delafield-Butt 
begin to explore the underlying causes of speech production differences 
in people with autism. Two major theoretical accounts for the prevalence 
of speech sound errors are highlighted: the speech attunement framework, 
and deficits in speech motor control. Both theories provide explanations 
for how children with autism may come to have difficulty learning to pro- 
duce speech which is in line with their typically developing peers. The 
chapter concludes by suggesting that both the speech attunement frame- 
work and the theory of impaired speech motor control may be complimen- 
tary, rather than competing, theories and suggests further empirical work 
to test this assertion. 

In the following chapter Joanne Cleland and James Scobbie focus on 
learning new speech motor plans in children with speech sound disorders. 
They first describe the concept of categorising persistent speech sound 
disorder in children as a disorder characterised by erroneous motor 
plans. They then go on to explain how various different forms of articu- 
latory visual biofeedback (namely, electropalatography, electromagnetic 
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articulography and ultrasound tongue imaging) can be used to allow chil- 
dren to view their articulators moving in real time and use this information 
to establish more accurate motor plans. A novel theoretical account of 
how these articulatory biofeedback techniques might lead to establishment 
of new motor plans is given. The chapter concludes with an illustrative 
case study of a child with persistent velar fronting who acquired a new 
motor plan for velar stops using ultrasound visual biofeedback. 

The role of motor aspects is then discussed by Marion Dohen, who 
provides a comprehensive review of the role of manual gestures in word 
learning and memorizing. Typically developing children as well as chil- 
dren with specific needs are the focus of this review. The findings from 
a selection of empirical studies serves for answering general questions 
about potential advantages, efficiency and types of manual gestures in 
learning novel words. Motor aspects of manual gestures, i.e. producing 
an additional gesture during learning is compared with findings where 
manual gestures are perceived only. Finally, Marion Dohen discusses three 
potential explanations as to why manual gestures might enhance learning 
novel words. 

The last two chapters are dedicated to learning new languages and how 
this information is consolidated in memory. Starting in the early 90s of 
the last century there have been several theoretical and empirical attempts 
to justify that it is not abstract linguistic representations, but rather epi- 
sodic traces (exemplars) that are stored in memory (see Smith 2015 for 
review). These may include fine phonetic detail, for example detail about 
the speaker’s voice, the communicative situation and so on. More recently, 
these approaches have been unified to hybrid models, since neither the one 
nor the other can solely account for learning a language. Lisa Morano, 
Louis ten Bosch, and Mirjam Ernestus follow then with their work on dif- 
ferent mental representations stored in memory. They place themselves in 
the Complementary Learning Systems framework which assumes the use 
of abstract and exemplar mental representations in speech comprehen- 
sion. Specifically, their work focuses on second language (L2) listeners’ 
exemplar representations of words. Their particular novel finding is that 
L2 exemplars are faithful representations of the speech signal. Thus, un- 
like abstract representations, exemplars have not been altered by listeners’ 
L1 phonological filter. The authors also found significant evidence that L2 
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listeners used abstract representations in an experiment that had been spe- 
cifically designed to trigger exemplar effects. 

In the last twenty years, evidence has been accumulated that sleep may 
play a major role for memory. Pamela Fuhrmeister reviews some recent 
work suggesting that memory consolidation during sleep is important for 
non-native speech sound learning. While factors that influence learning 
of difficult speech sound contrasts have received a lot of attention in the 
literature, less is known about how what happens after learning can af- 
fect consolidation and retention of newly learned phonetic information. 
Studies from other domains, such as motor learning, are reviewed, and 
these suggest that certain tasks that follow training can interfere with 
consolidation of new information. Hints of these effects can already be 
found in studies in the speech domain, and the author argues that the 
sleep and memory consolidation literature should inform future speech 
research and that future research should consider not only how speech 
sounds are best learned, but also how they are most optimally consoli- 
dated and retained. 

Besides the intellectual merit of the authors and reviewers, this book 
was only possible thanks to the financial support the French-German 
University (UFA) in Saarbriicken. They supported the publication of the 
book and the international winter school on “Speech production and per- 
ception: Learning and memory” that took place 2017 in Chorin, Germany. 
Furthermore, parts of the presented work were supported by a grant from 
the ANR-DFG to the Salammbo project (FU791/8-1). Finally, we would 
like to acknowledge all reviewers who helped improving clarity and 
structure of the chapters. We are very delighted that the Crowdfunding 
Initiative “Knowledge Unlatched” selected our book and allowed it to be 
open access. Thanks to all the unknown contributors! 
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Changes in speech production in response to 
formant perturbations: An overview of two 
decades of research 


Abstract: One way to investigate speech motor learning is to create artificial 
adaptation situations by perturbing speakers’ auditory feedback in real time. 
Formant perturbations were introduced by Houde and Jordan (1998), providing 
the first evidence that speakers adapt their pronunciation to compensate for these 
perturbations. Twenty years later, this chapter provides an overview of the general 
impact of Houde and Jordan’s work in speech research and beyond, as well as a 
more detailed review of studies that involve formant perturbations. The impact of 
Houde and Jordan’s work appears to be cross-disciplinary. Although mainly related 
to speech production and perception, it has also been cited in the limb movement 
and even animal research, mainly as evidence of adaptive sensorimotor control. 
Formant perturbations research has expanded rapidly since 2006, spreading across 
the world and many research teams. We identified 77 experimental studies focused 
on formant perturbations which we then analyzed with regard to technical and the- 
oretical issues. This analysis showed that various apparatuses and procedures were 
used to address important topics of speech research. A primary interest has been in 
feedback and feedforward control mechanisms in speech. These mechanisms were 
addressed in different populations, including adults and children with typical vs. 
atypical development, with behavioral or neurophysiological approaches, or both. 
Some formant perturbations studies more specifically focused on the integration of 
auditory and somatosensory feedback in speech production, while others explored 
the interaction between speech production and perception of phonemic contrasts. 
Some research questioned the processes and the nature of speech representations by 
investigating generalization of adaptation to formant perturbations. Finally, a few 
studies were interested in the effect of extraneous variables such as surface effects 
or speakers’ general cognitive abilities. Altogether, these studies provide insights 
into speech motor control in general and into the understanding of sensorimotor 
interactions in particular. The field has developed recently and may still expand in 
the future, as it allows us to address fundamental topics in speech research such 
as perception-production links or abstract vs. exemplar representations. Future 
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research with formant perturbations may also further connect sensorimotor adap- 
tation to linguistic and cognitive factors and in particular to working and long-term 
memory. 


Keywords: perturbation, real-time auditory feedback, formants, speech units, 
learning 


1. Introduction 


As an “extraordinary feat of motor control” (Kelso, Tuller, Vatikiotis- 
Bateson, & Fowler, 1984, p. 812), speech production is a challenging 
research topic, highly influenced by movement sciences (Grimme, Fuchs, 
Perrier, & Schöner, 2011; Maas et al., 2008). Speech motor control indeed 
shares numerous features with other sensorimotor systems and in partic- 
ular with limb motor control. Among these features, sensorimotor adapt- 
ability of speech is of particular interest to speech science as the basis 
of speech rehabilitation (Maas et al., 2008), and since it is ubiquitous in 
daily life. Common examples include, among others, changes in the way 
we speak according to our interlocutor or to the surroundings, such as 
speaking louder when talking with someone with a hearing impairment or 
in a noisy environment (Garnier, Henrich, & Dubois, 2010); or spontane- 
ously imitating our interlocutor’s speech sounds (Pardo, 2006). Speech 
motor control also adapts throughout the lifespan to natural or accidental 
alterations of our sensory systems or vocal tract geometry, temporarily 
or more permanently (Jones & Munhall, 2003; Lane et al., 2007). These 
adaptations allow maintenance of some level of intelligibility despite vocal 
tract growth, hearing loss, orofacial surgery, or when wearing a dental 
apparatus, losing teeth, speaking while eating etc. Being essential to speech 
production, sensorimotor adaptation of speech is the topic of numerous 
studies. For the purpose of this chapter, we will focus on studies that 
involved specific perturbation of formants. Formants are frequencies cor- 
responding to peaks of acoustical energy, the relative values of which char- 
acterize vowels. Research in this field, and especially Houde and Jordan’s 
work, was inspired by the study of visuomotor adaptation in the limb 
movement literature (Houde & Jordan, 1998). 

Pioneering work on adaptation of different visuomotor activities 
appeared at the end of the 19th century (Held, 1965; Stratton, 1897). This 
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work introduced a now common approach to assessing visuomotor adap- 
tation that consists of investigating changes in movement in response to a 
systematic distortion of visual feedback, such as prism adaptation. As an 
illustration, Stratton (1897) reported his own and extreme everyday life 
experience while wearing an apparatus for eight days that reversed the ret- 
inal image upside down and left to right. On the first day, “the entire scene 
appeared upside down”. He felt nauseous. His movements were “labo- 
rious”, “embarrassed”, “inappropriate” (p. 344), required a lot of atten- 
tion and were “extremely fatiguing” (p. 344). By the start of the third day 
things were much better, with no sign of “nervous distress” (p. 349). At 
the end of the fourth day, he “preferred to keep the glasses on rather than 
sit blindfolded” (p. 351/352). When the apparatus was removed on day 
eight, it took him some time to go back to normal feelings and motions. 
Later work on visuomotor adaptation focused on more specific activi- 
ties, less dramatic and more local and short-term changes, with a focus on 
reaching movements performed with rotations of the visual field. In this 
context, it has been repetitively demonstrated that when movements are 
achieved while the visual field is shifted by a specific angle (a), participants 
first miss the target by the same angle a. However, with repetition, they 
progressively learn to adapt their movements to the new feedback and 
reach the target accurately again. When they return to normal vision, 
after-effects and transfer effects are observed: participants miss the training 
target (after-effects) and/or a new target (transfer) by an angle more or 
less close to —æ. These effects vary as a function of the angular distance 
between the training and the testing targets (Krakauer, Pine, Ghilardi, & 
Ghez, 2000; Shadmehr & Mussa-Ivaldi, 1994). Sensorimotor adaptation 
has been attributed early on to feedforward control (i.e. predictive control 
based on learnt sensorimotor mappings) in contrast to forward closed- 
loop control (i.e. online processing of sensory inputs), visible in correction 
to unexpected perturbations (Golfinopoulos, Tourville, & Guenther, 2010; 
Houde & Chang, 2015). These notions are defined later in this chapter. 
Twenty years ago, Houde and Jordan (1998) introduced an analogous 
procedure of visuomotor rotation adaptation to question feedforward 
control in speech, which used real-time alterations of formant frequencies 
in vowels. By altering the frequencies of the first and/or second formants 
(F1 and F2 respectively) it is possible to make a vowel sound like another 
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vowel. For example, by decreasing F1 and increasing F2, the vowel /e/ 
would sound closer to the vowel //, as illustrated in Figure 1. This alter- 
ation displaces the auditory feedback, in the same way as prism vision 
displaces the visual position of the target. For example, the speaker says 
“head”, speaking into a microphone and wearing headphones (Figure 1.A). 
The signal is processed in real time so that F1 and F2 formants are moved 
towards “hid” (Figure 1.B), and played-back into the headphones. The 
consequence for the speaker is a discrepancy between the auditory target 
expected from the planned movements (“head”) and the auditory target 
they actually got (~“hid”). In other words, similar to visuomotor adap- 
tation, the speaker first misses the auditory target (Figure 1.C, “Training 
start”). With practice — repetition of shifted utterance(s) with the same per- 
turbation — the speaker adapts to the perturbation (Figure 1.C, “Training 
end”): To reach the auditory target “head” again in the presence of the 
perturbation, they produce formants in the opposite direction to the per- 
turbation. In our example, this corresponds to the production of an utter- 
ance closer to “had”. When the feedback is returned to normal or masked 
with a noise, for the same vs. different utterance(s) than the training one(s), 
after-effects vs. transfer effects are observed (Figure 1.C, column “After- 
effect” and “Transfer”). This suggests that the compensation is not only an 
online feedback control change but also affects auditory-motor mappings 
supporting feedforward control, in a more or less utterance or segment- 
specific way. The procedure was later adapted to address feedback control 
by investigating online compensation to unexpected perturbations (Purcell 
& Munhall, 2006b). 

Adaptation to formant perturbations has been investigated per se, or 
used as a paradigm to address more general issues in speech science. The 
current chapter reviews research in formant perturbations by analyzing 
Houde and Jordan’s seminal study (Houde & Jordan, 1998, 2002) and the 
scientific literature that has referred to it. Using this approach (detailed in 
the first section of the chapter) we can see the cross-disciplinary impact of 
Houde and Jordan’s work and in particular, identify the main topics of the 
scientific literature that have cited this work (reported in the second part of 
the chapter). Among the collected papers, only a subsection corresponded 
to empirical studies involving formant perturbations. Based on the analysis 
of these studies, including review of their reference lists, the latter parts 
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Figure 1: The auditory prism adaptation. (A) The speaker speaks into a 
microphone; his feedback is altered such as when he produces “head” he is 
hearing a signal closer to “hid”; (B) To do so, F1 and F2 are changed in real time; 
(C) Before the introduction of the perturbation (Baseline) the auditory feedback 
is consistent with the target. The first exposure to the perturbation (Training 
start) induces a discrepancy (or an error) between the auditory feedback and the 
planed target. With repetitive exposure to the perturbation, the talker changes 
his production to compensate for the perturbation (Training end). When the 
perturbation is removed after-effects and/or transfer effects are observed. 


of the chapter provide: (1) a description of the main apparatuses and 
paradigms used in formant perturbations studies; (2) an overview of the 
research topics addressed using these perturbations and the main reported 
results; and (3) some perspectives for future research. 


2. Paper collection and analysis 


As we were interested in the impact of Houde and Jordan’s work and also 
wanted to provide an analytical review of formant perturbations studies, 
we first analyzed the published work that referred to Houde and Jordan 
(1998 and/or 2002) from 1999 to 2018 (last update on July 6th 2018). 
This was performed using the “Cited by” function in Google Scholar. 
We choose this approach rather than keyword research, as we wanted 
to collect various sorts of publications, and because it appeared to be the 
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Table 1: Number of references in each category of the first level of selection (see 
text for details) 


Formant shift No formant shift Notin Error Total 
Rejected Kept Rejected Kept English ref. 
26 72 (+2, Houde & 140 287 35 22 584 


Jordan 1998 and 2002) 


most systematic way to collect publications in the field. To compensate for 
potential errors and omissions by Google Scholar, the results were then 
analyzed very closely. 

An analysis by year of Google Scholar output resulted in a total of 584 
references (including the two papers by Houde and Jordan, see Table 1). 
As a first step, we excluded documents that were not written in English 
or that corresponded to reference errors (57 in total, see Table 1). Among 
the 527 remaining references, we distinguished between those without vs. 
with an empirical study that included formant perturbations. In the former 
category (n=427, without formant perturbation), we kept only journal pa- 
pers for a thematic analysis of Houde and Jordan’s broad impact (n=287). 
In the latter category (n=100, with formant perturbations), we first kept 
all the documents except PhD or Master theses, posters or abstracts to 
conferences (74 references kept, 26 rejected). Note that there were 11 
PhD theses; most of them were associated with journal publications. For 
consistency in criteria, we did not include Frank (2011)’s PhD thesis, 
even though it is often cited by studies investigating linguistic effects on 
formants adaptation. Its results were never published in peer-reviewed 
papers. 

Three more papers were added that included formant perturbations. 
One paper that did not cite Houde and Jordan was found in the refer- 
ence list of the selected papers (Niziolek & Guenther, 2013); and two pa- 
pers in course of publication at the time of writing that we were aware of 
(Caudrelier, Perrier, Schwartz, & Rochet-Capellan, 2018; Klein, Brunner, 
& Hoole, in this book). The general characteristics of the documents 
including formants perturbations are described in Table 2. Technical 
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Table 2: Number of papers considered for the analysis of formant perturbations 
according to source and type. Houde & Jordan (1998, 2002) are included. 


Journal papers Proceedings papers Reports/ chapters Total 
Google Scholar 55 17 2 74 
Other sources 1 1 1 3 


papers as well as papers investigating compensation to unexpected for- 
mant perturbations were included. 

The full list of analyzed papers related to formant perturbation is avail- 
able in Table 4, with their main related research topic indicated. As the 
paper collection is based mainly on the “cited by” function of Google 
Scholar some papers may be missing despite our careful attention. 
However, we believe our analysis provides an accurate picture of the field 
at the time it was run. 


3. Overall impact of Houde and Jordan’s seminal work 


The overall impact of Houde and Jordan (1998, 2002) is illustrated in 
Figure 2. We distinguished seven broad categories of research: (1) for- 
mant perturbations studies (n=77); (2) studies that investigated speech 
compensation and/or adaptation to other auditory perturbations or equiv- 
alent situations (n=91) or (3) to an alteration of the vocal tract (n=16); 
(4) empirical or theoretical papers on speech production (n=61) or (5) on 
speech perception (n=46); (6) studies involving non-speech actions (n=25); 
and (7) experimental or theoretical papers involving animals (n=43). Five 
papers were not considered, as they were difficult to classify in these cate- 
gories. We first analyzed the journal papers that did not empirically test for- 
mant perturbations. As described above, this involved 286 articles. Broad 
research topics were identified mainly from abstract reading. A subset of 
papers was selected and read in more detail to illustrate the different topics. 
The articles on formant perturbations will be reviewed in detail in the next 
sections. We will now briefly overview the research topics in the six other 
categories. References in the following section are illustrative. 
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3.1. Compensation/adaptation of speech production 
to various auditory perturbations 


Speech compensation and adaptation were investigated prior to the devel- 
opment of formant perturbation studies and used various methods. These 
methods continued to be used in some of the later work that cited Houde 
and Jordan. About half of the papers in this first category investigated 
speech modifications in reaction to either an unexpected or a predictable 
modification of FO in different populations and conditions. A number of 
papers in this topic were published by Jones et al. (Jones & Munhall, 
2000); Larson et al. (Burnett & Larson, 2002); or Hanjun et al. (Li et al., 
2016). The other half of the studies investigated speech modifications in 
reaction to other types of auditory perturbations such as delayed auditory 
feedback (Chon, Kraft, Zhang, Loucks, & Ambrose, 2013); changes in 
intensity or noise level (Maas, Mailend, & Guenther, 2015); hearing loss 
(Palethorpe, Watson, & Barker, 2003); real or simulated use of cochlear 
implants (Casserly, 2015; Lane et al., 2007); or replacement of the audi- 
tory feedback by a stranger’s voice (Hubl et al., 2014). Other work 
modified consonant features such as frication (Shiller, Sato, Gracco, & 
Baum, 2009) or voicing (Mitsuya, MacDonald, & Munhall, 2014). Self- 
regulation in adaptation to formant perturbations was also linked with 
interpersonal auditory-motor regularizations in speech such as phonetic 
convergence (Pardo, 2006). 


3.2. Compensation/adaptation of speech production to 
perturbations of the vocal tract dynamics or geometry 


Research on compensation and adaptation to perturbations affecting the 
somatosensory feedback is another field closely connected to adaptation 
to formant perturbations. Houde and Jordan’s work was thus cited by 
studies involving an alteration of the vocal tract geometry or dynamics. 
This includes dental prostheses (Jones & Munhall, 2003); lip tubes in chil- 
dren and adults (Ménard, Perrier, & Aubin, 2016); false palates (Thibeault, 
Ménard, Baum, Richard, & McFarland, 2011); mechanical forces applied 
to the jaw with a robot (Tremblay, Shiller, & Ostry, 2003); or more per- 
manent changes such as those induced by oropharyngeal cancer treatments 
(de Bruijn et al. 2012). 
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Figure 2: Overall impact: number of analyzed papers by year and categories. 


3.3. Empirical or theoretical papers on speech production 


Houde and Jordan’s work is cited by empirical and theoretical research 
on speech production. For example, adaptation to formant perturbations 
is mentioned by studies providing further evidence of the role of auditory 
feedback in speech motor control, such as work linking auditory acuity 
to the production of speech contrasts (Perkell et al., 2004); auditory per- 
ceptual learning with improvement in production (Shiller, Rvachew, & 
Brosseau-Lapré, 2010); comparing overt and covert speech (Brumberg 
et al., 2016) or analyzing the neurophysiological activities of the auditory 
cortex during speech production (Curio, Neuloh, Numminen, Jousmaki, 
& Hari, 2000). Adaptation to formant perturbations provides support for 
neurocomputational models of speech production such as the Directions 
Into Velocity of Articulators model (DIVA, Golfinopoulos et al., 2010) or 
the State Feedback Control model (SFC, Houde & Chang, 2015), both 
models assuming a feedback and a feedforward control mechanism. 
Further information about these control mechanisms will be provided in 
the section describing formant perturbation studies related to this topic. 
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3.4. Empirical or theoretical papers on speech perception 


Adaptation to formant perturbations is also taken as evidence of sensori- 
motor integration in speech. As such, it is relevant for papers probing or 
discussing the role of the motor system in speech perception (Sato, Troille, 
Ménard, Cathiard, & Gracco, 2013) or in theoretical papers related to the 
dual-stream model of language processing. Basically, this model proposes 
a cortical ventral stream that maps speech sounds to concepts, and a dorsal 
stream for auditory-motor mapping. Adaptation to formant perturbations 
is then cited as an evidence that a dorsal auditory-motor integration path 
is still functional in adulthood (Hickok & Poeppel, 2004). 


3.5. Non-speech movement studies 


Various non-speech studies cited Houde and Jordan’s work to illustrate 
sensorimotor adaptation in humans. These studies focused on activities 
involving auditory feedback such as piano playing (Pfordresher & Palmer, 
2006); or the learning of artificial auditory-arm movement maps (van 
Vugt & Ostry, 2018). Some papers were also interested in other kinds 
of sensorimotor adaptations such as swallowing (Wong, Domangue, Fels, 
& Ludlow, 2017), or visuomotor adaptation of limb movements (Wei 
et al., 2014). Note that as formant perturbations studies were inspired 
by visuomotor adaptation, they often referred to limb movement litera- 
ture. The converse seems not necessarily true as our research suggests that 
few works on limb adaptation have cited Houde and Jordan’s work. This 
result should be taken cautiously as limb movement research could cite 
other studies using formant perturbations to illustrate the adaptability of 
speech motor control, and we only collected papers that reference Houde 
and Jordan using “cited by” functionality of Google Scholar. 


3.6. Animal studies 


Finally, animal studies have early, and regularly, cited Houde and Jordan’s 
work (Figure 2), with a main focus on the role of auditory feedback in 
action control. Over half of these papers were dedicated to birdsong and 
published by Brainard et al. and/or Doupe et al. and/or Sober et al. Many of 
these papers include studies of birdsong production or learning using audi- 
tory perturbations with behavioral and/or neurophysiologic recordings, as 
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well as interspecies comparative reviews about the processing of auditory 
feedback of self-produced sounds (Brainard & Doupe, 2000; Doupe & 
Kuhl, 1999; Sober & Brainard, 2009). Analogous works were done in 
bats (Smotherman, Zhang, & Metzner, 2003) and primates (Eliades & 
Miller, 2017). 

To summarize, this non-exhaustive analysis of the overall impact of 
Houde and Jordan’s seminal work suggests that it is (as expected) cited by 
papers investigating speech compensation and adaptation to other types 
of sensory perturbations. Most of the scientific questions in this first set of 
papers overlap with the research topics we will review based on the more 
detailed analysis of formant perturbations studies in the related section 
of this chapter. In a broad context, adaptation to formant perturbations 
is often interpreted as evidence for sensorimotor integration and sensori- 
motor plasticity in speech production and perception. It is cited to illustrate 
auditory feedback and feedforward control mechanisms in speech produc- 
tion, as explained below, and taken as an example of such mechanisms 
(and their plasticity) in studies investigating animal vocalizations, singing, 
music playing, but also inter-personal convergence or coordination of 
movements. 

Note that more research topics related to formant perturbation studies 
may be found by including “2nd order” connections to Houde and Jordan’s 
work (i.e. references that cite any of the studies on formant perturbations). 


4. Methods in formant perturbation studies 


In this section, we provide an overview of the apparatuses used to apply 
real-time formant perturbation and a description of the main procedures 
identified in the collected papers. 


4.1. Real-time formant perturbation 


The systems used to shift formants in the collected papers are summarized 
in Table 3. Paper details can be found in Table 4. With regards to formant 
perturbation, it is important to emphasize that in order to preserve the 
best quality of self-perception, the real-time modification of formants in 
speakers’ auditory feedback should meet some requirements, specifically: 
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(1) The signal should be processed and played back fast enough for the 
speaker not to perceive any delay (less than 30ms, see Yates, 1963). 
Specific digital signal processing boards (DSP), including systems from the 
music industry were used, especially in earlier work. Nowadays, this can 
be achieved at a software level, on a PC with appropriate sound card and 
software to analyze and change formants. For the same code, the achieved 
delay can vary depending on the operating system and hardware. 

(2) The parameters of the signal processor should be adapted to the 

speaker and/or to the vowel. This parameterization improves the for- 

mant detection and the reliability of the perturbation. 

Perception of unperturbed feedback (bone conduction and air conduc- 

tion outside the headphones) should be reduced as much as possible. 

Different approaches were used to achieve this aim, such as: 

e Using whispered speech (Houde & Jordan, 1998, 2002) although 

subsequent studies were run with normal speech; 


Qa 


Using closed headphones or insert earphones to reduce the percep- 
tion of the air-conducted signal. The occlusion effect of the head- 
phones on adaptation was recently investigated with no significant 
difference in the magnitude of F1 adaptation between the use of the 
closed Sennheiser “HD 265” and the insert Etymotic Research ER2 
(Mitsuya & Purcell, 2016); 

Increasing the level of the feedback in the headphones, up to 87dB 
SPL (Villacorta et al., 2007); 

e And/or using a masking noise mixed with the played back signal to 

mask bone-conducted speech. 

The shifted vowel should have clearly distinguishable F1 and/or F2 
values, and the shift should be consistent with these values. For this 


s 


reason, the vowel /e/ is chosen in most of the studies as shifting more 
extreme front or back vowels could be limited by overlap in F1-F2 or 
FO-F1 frequencies (Mitsuya, MacDonald, Munhall, & Purcell, 2015), 
and this vowel allows upward and downward perturbations. 


Different research groups have developed their own formant perturba- 
tion systems (Table 3) with four main categories: (1) The two systems 
developed by Houde described with more details in Houde’s PhD (Houde, 
1997) for whispered speech (1.a), and then in Katseff, Houde, & Johnson 
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(2012) for voiced speech (1.b); (2) The system developed and used by 
Munhall, Purcell and collaborators that used a specific hardware; (3) The 
system used by Perkell and Guenther’s teams that first included specific 
hardware (Villacorta et al., 2007) and was then adapted as a free software 
for Matlab. It supports various auditory perturbations, including changes 
in F1 and/or F2, but also more complex ones such as formant trajectory 
perturbations (Cai, Boucek, Ghosh, Guenther, & Perkell, 2008; Tourville, 
Cai, & Guenther, 2013). The last version is called “Audapter” and can 
be download on github.com (https://github.com/shanqing-cai/audapter_ 
matlab, this link was retrieved July, 6, 2018); (4) The last system was 
developed in parallel by three teams: Max et al., Ostry et al., and Shiller 
et al. It uses a device from the music industry (VoiceOne, TC Helicon) that 
by default allows shifting of all the formants while preserving FO. This 
system was used as a way to alter all formants in the same direction (Max 
& Maffett, 2015) or, with supplementary signal processing steps, including 
filtering and mixing, as a way to perturb F1 only (Rochet-Capellan & 
Ostry, 2011). A few papers were dedicated to the presentation and first 


Table 3: Main signal processing systems used in the literature to perturb formants 
in real time (references indicate the publication describing the system) and number 
of papers using the system. 


System 1 System 2 System 3 System 4 
References Houde (1997); Purcell & Villacorta etal. Feng et al. Others 
Katseff et al. (2012) Munhall (2007); Cai etal. (2011); 
(2006ab) (2008); Tourville Rochet- 


e& al. (2013) Capellan & 
Ostry (2011); 
Shum et al. 
(2011) 
Signal 1.a. Whispered National Texas Instruments Electronic Other 
processing speech: Analysis- Instruments C6701 Evaluation speech software 
synthesis PXI-8176 Module DSP processor from or 
process, DSP- 96 embedded board then music industry hardware 
board, Ariel, controller C-extension VoiceOne; solutions — 
Inc. 1.b. Voiced Mex for TC Helicon + 
speech: “Feedback Matlab, opened filters 
Alteration Device” — access — Audapter 
Sinewave synthesis 
Number of 10 23 2 then 20 19 3 


papers 
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evaluation of these different perturbation systems. This was the case with 
Cai et al. (2008) and Tourville et al. (2013) and with the preliminary work 
by Shih, Suemitsu, & Akagi (2011). Two papers also presented a method 
to perturb formants in populations in which speech acoustics have deteri- 
orated, by coupling articulatory synthesis with Audapter (Berry, North, & 
Johnson, 2014; Berry, North, Meyers, & Johnson, 2013). 

As displayed in Table 4, most of the studies involved native speakers of 
English, mainly from North America. Other languages were investigated 
in a few comparative studies or in relation to other research questions as 
described in the next section. Potential generalization of these findings to 
other languages and populations should therefore be taken with caution. 


4.2. Main procedures in formant perturbation 
studies and related concepts 


The main procedures identified in the collected papers about formant 
perturbations are summarized in Figure 3. These procedures will be 
referred to in relation to the research topics detailed in the next section. 
Two main approaches can be distinguished: 


(1) Unexpected formant perturbation during the production of prolonged 
utterances: This first approach was used in only a few of the collected 
papers (n=11, ~14 % of the papers with formant perturbations, see 
Table 4). The perturbation is only applied to a small proportion of 
utterances so that talkers cannot anticipate the perturbation. Moreover, 
the utterances are produced with long vowel duration (steady-state 
vowels) so that corrective answers result from online processing of 
the auditory feedback (cf. Figure 3, procedure P4). This correction is 
called compensation. 

(2) Systematic and constant perturbation over a number of 
utterances: This second approach was used in the majority of the 
papers (n=66, ~86 %, Table 4). The basic procedure is represented 
in Figure 3, procedure P1. It involves the production of utterances 
with “natural” duration, in general. After a baseline with unaltered 
auditory feedback, the perturbation is introduced either gradually 
or abruptly, and then systematically applied at a constant level. 
Depending on the research group, changes in formant production 
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Figure 3: Overview of procedures used in formant perturbations studies. 
Duration of experimental phases and perturbations were variable across studies. 
P1 is the basic procedure to study auditory-motor adaptation, used in Munhall 
et al.’s studies. It was adapted to investigate the transfer of adaptation (P1t) 
(MacDonald, Pile, Dajani, & Munhall, 2008; Rochet-Capellan, Richer, & Ostry, 
2012) and the effect of auditory motor adaptation on perception (P1p) (Lametti, 
Rochet-Capellan, Neufeld, Shiller, & Ostry, 2014) or the effect of perceptual 
training on sensorimotor adaptation (Lametti, Krol, Shiller, & Ostry, 2014). 

P2 is the procedure used in Houde & Jordan (1998) and then by Perkell et al. 
(Villacorta, Perkell, & Guenther, 2007). It is structured in epochs with training 
words produced with feedback followed by training words and generalization 
words produced with a masking noise. P3 is the multiple perturbation procedure 
developed in Rochet-Capellan & Ostry (2011), during which words are produced 
in random order with specific perturbation associated with each word. P4 is 

the compensation procedure to unpredictable perturbations. In this last case, 
long steady-state vowels are produced and the perturbation is introduced 
randomly for a small proportion of utterances to assess online correction (Purcell 
& Munhall, 2006b). Grey scale gradient in the ramp phase represents the 
progressive introduction of the shift. 


at the end of the training phase are referred to as compensation (cf. 
Houde & Jordan, 1998; Purcell & Munhall, 2006b) or adaptation 
(cf. Rochet-Capellan, Richer & Ostry, 2012, Martin et al., 2018), 
and residual changes when the feedback is returned to normal after 
training are referred to as adaptation or after-effect, respectively. 
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This procedure was also used to assess generalization (or transfer) of 
adaptation to untrained utterances, either in the course of the training 
phase (Figure 3, procedure P2) or after the training (Figure 3, proce- 
dure P1t), as presented in the next section. 


Hereafter, adaptation will refer to changes observed at the end of the 
training phase in response to a systematic perturbation. Compensation 
will mainly refer to changes in response to unpredictable perturbations but 
will also be used to qualify the direction of adaptive responses (by contrast 
with following responses that go in the same direction as the perturbation). 


5. Research topics tackled with formant perturbations 


In this section, we provide a thematic review of the collected papers that 
included an empirical study of formant perturbation. As much as possible, 
we chose to associate each paper with a main topic but obviously a paper 
could be related to more than one topic. Table 4 provides a list of all the 
cited references and their main associated research topics. 


5.1. Properties of feedback and feedforward control 


Many studies involving formant perturbations are related to the role 
of auditory feedback in speech motor control and distinguish between 
feedback and feedforward control mechanisms. Feedback control is a 
closed-loop system that involves the sensory consequences of the current 
motion. It is regarded as too slow to account for rapid control and rapid 
adjustments observed in fast coordinated actions. Rapidity and adapt- 
ability of motion were identified early on as evidence of a feedforward 
control mechanism by researchers in visuomotor adaptation. The core 
idea is that the brain makes predictions of the sensory consequences of 
its actions based on an efference copy of the motor command (Houde & 
Jordan, 2002). These predictions involve mappings between motor and 
sensory representations also called internal models (Purcell & Munhall, 
2006a) or sensorimotor memories (see Perrier, 2012, for a discussion of 
the nature of internal models in speech). The DIVA (Golfinopoulos et al., 
2010) or the SFC (Houde & Chang, 2015) neurocomputational models of 
speech production assume the existence of both feedback and feedforward 
control networks that involve auditory and somatosensory systems. When 
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the prediction based on internal models does not match the actual sensory 
input, the internal representations are changed to reduce this prediction 
“error” so that future movements performed in similar conditions will be 
accurate. This mechanism is claimed to underlie sensorimotor adaptation. 

In this context, a first subset of studies with formant perturbations was 
designed to “Investigate the nature, level of details, and use of internal 
models in speech production” (Max, Wallace, & Vincent, 2003, p. 1053) 
and to “begin to parameterize the formant feedback system” (MacDonald, 
Goldberg, & Munhall, 2010 p. 1060). The main contribution of these 
studies is to describe the role of auditory feedback in the control of for- 
mant production, and the adaptability of this control. In these papers, 
adaptability is mainly explained or taken as an evidence for feedforward 
internal models. 

To address the properties of adaptation to formant perturbations, 
Houde and Jordan (2002) analyzed in more detail the adaptation phenom- 
enon introduced in Houde and Jordan (1998). The results highlight some 
properties of feedback and feedforward control that were subsequently 
discussed and investigated in later work, involving various types of for- 
mant perturbations and procedures. 

The first observation of Houde and Jordan was that the changes in F1 
and F2 production in talkers’ speech were compensatory responses, in the 
opposite direction to the perturbation. This result has been reproduced 
consistently in later work when between-speaker data are aggregated. 
Individual data suggests that some speakers follow the shift, however. 
For example, in a meta-analysis of their own studies of adaptation to for- 
mant perturbations, MacDonald et al. (2011) found that 26 out of 116 
female speakers followed F1 or F2 shifts when their production of “head” 
was perturbed toward “had”. A possible explanation is that non-adapted 
speakers may not be able to dissociate their own production from the 
auditory feedback (Vaughn & Nasir, 2015). Following the formant shift 
rather than compensating for it was actually the most frequent behaviour 
observed in a preliminary study investigating compensation in Japanese 
speakers to unexpected perturbations of F1, F2 and F3 (Shih et al., 2011). 
Aside from this study, all other published work on formant perturbations 
observed significant compensatory adaptation in acoustic analyses, 
whereas preliminary analyses of articulatory correlates of adaptation are 
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Table 4: List of all the studies related to formant perturbation included in the pre- 
sent review. The first column provides the reference of the article. The 2nd column 
gives the language of participants (Du: Dutch, En: English, Fr: French, Ge: German, 
Ja: Japanese, Ko: Korean, Ma: Mandarin, Ru: Russian, Sp: Spanish). Column 3 
is related to the perturbation systems, which are described in Table 3 (briefly, 
1.a: Houde & Jordan (1998), 1.b. Katseff et al. (2012); 2: Purcell & Munhall, 
(2006a); 3: Audapter and its previous versions; 4: VoiceOne, TC Helicon, 5: Others) 
and column 4 indicates whether an article is mainly dedicated to the description of 
a perturbation system. Each study has been classified into either compensation (to 
unpredictable perturbations, column 5) or adaptation (to sustained perturbations). 
Columns 7 to 14 show whether the article is related to each of the main research 
topics presented in the present review. A cross indicates that the article is cited in 
the corresponding subsection, while a (X) indicates it is not although it is related 
to the topic. 


Properties of feedback and feedforward 
Transfer/Specificity and speech units 
Pathology affecting speech production 
Neural basis of speech motor learning 


Perceptual & phonological categories 
* | Surface effects & speakers’ characteristics 


% | Perception acuity and sensory integration 


E a 
eo 2 
e 
ae g 
a o 2 È g d 
8 % € 3 2 8 E 
g 3S ê g 3 S = > 
5 ® 3 5 e agg = 
E 25 2 8 3 Fs 
eZ Ha FO < 8 a 
Alsius, Mitsuya, Latif, & En 2 X ) 
Munhall, 2017 
Berry, Jaeger, Wiedenhoeft, En 3 X X X 
Bernal, & Johnson, 2014 
Berry, North, & Johnson, 2014 En 3 X 
Berry, North, Meyers, & En 3 X 
Johnson, 2013 
Bourguignon, Baum, & Shiller, En 4 X X 
2014 
Bourguignon, Baum, & Shiller, En 4 X X 
2015 
Bourguignon, Baum, & Shiller, En 4 X X 
2016 
Cai, Beal, Ghosh, Tiede, En 3 X (X) X 


Guenther, & Perkell, 2012 
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Table 4: Continued 


References 


Cai, Boucek, Ghosh, Guenther, 
& Perkell, 2008 

Cai, Ghosh, Guenther, & 
Perkell, 2010 


Cai, Ghosh, Guenther, & 
Perkell, 2011 


Language 


3 


Caudrelier, Perrier, Schwartz, & Fr 


Rochet-Capellan, 2016 


Caudrelier, Perrier, Schwartz, & Fr 


Rochet-Capellan, 2018 
Caudrelier, Schwartz, Perrier, 
Gerber, & Rochet-Capellan, 
2018 


Daliri, Wieland, Cai, Guenther, 


& Chang, 2018 


Fr 


En 


Lametti, Krol, Shiller, & Ostry, En 


2014 

Lametti, Nasir, & Ostry, 2012 
Lametti, Smith, Freidin, & 
Watkins, 2018 

Demopoulos et al., 2018 


Deroche, Nguyen, & Gracco, 
2017 


Dimov, Katseff, & Johnson, 
2012 


Eckey & MacDonald, 2015 
Feng, Gracco, & Max, 2011 
Houde & Jordan, 1998 


En 


| Perturbation System 


>| System description 


Compensation 


| Adaptation 
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Properties of feedback and feedforward 
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Pathology affecting speech production 
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Table 4: Continued 


References 
Language 


Houde & Jordan, 2002 En 
Ito, Coppola, & Ostry, 2016 En 
Katseff & Houde, 2008 En 


Katseff, Houde, & Johnson, En 
2012 


Klein, Eugen; Brunner, Jana; Ru 
Hoole, Phil (sous press) 


Lametti, Rochet-Capellan, En 
Neufeld, Shiller, & Ostry, 2014 


MacDonald & Munhall, 2012 En 
MacDonald, Goldberg, & En 
Munhall, 2010 

MacDonald, Johnson, Forsythe, En 
Plante, & Munhall, 2012 
MacDonald, Pile, Dajani, & En 
Munhall, 2008 


MacDonald, Purcell, & En 
Munhall, 2011 

Martin et al., 2018 Sp 
Max & Maffett, 2015 En 
Max, Wallace, & Vincent, 2003 En 
Mitsuya & Purcell, 2016 En 


Mitsuya, MacDonald, Munhall, En 
& Purcell, 2015 

Mitsuya, MacDonald, Purcell, En 
& Munhall, 2011 

Mitsuya, Munhall, & Purcell, En 
2017 
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Perception acuity and sensory integration 


Perceptual & phonological categories 


Transfer/Specificity and speech units 
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less clear. Max et al. (2003) analyzed acoustic changes to perturbation of 
all formants in the same direction in relation to jaw and tongue movement 
during adaptation. No consistent behaviour were observed in articulatory 
kinematics. Similar results were obtained in a pilot study in one Korean 
speaker with an F2 shift (Neufeld, Purcell, & Van Lieshout, 2013), while 
clearer tongue compensation movements were reported in speakers with 
blindness (Trudeau-Fisette, Tiede, & Ménard, 2017). On the other hand, 
while the majority of studies on adaptation to formant perturbations 
found significant compensatory responses, it was also shown that adap- 
tation vanishes when perturbed feedback is delayed by more than 100ms 
(Max & Maffett, 2015), or is at least largely reduced (Mitsuya, Munhall, 
& Purcell, 2017). 

Houde and Jordan also reported that maximal changes at the end of 
training did not fully compensate for the perturbation. This result was 
systematically reproduced in later studies. As an illustration, in Purcell 
& Munhall (2006a), the maximal adaptation to a 200Hz upward vs. 
downward shift of F1 compensated for about 30 % of the perturbation, 
regardless of the number of repetitions during the hold phase. This also 
suggests that adaptation is a fast process, in agreement with Max et al. 
(2003)’s observation that compensatory responses occurred after only a 
few repetitions. However, a F1 perturbation of at least 60Hz (80Hz on 
average across conditions) was required in Purcell & Munhall (2006a) to 
initiate the compensatory response. Similar thresholds were reported in 
later work, regardless of the delay in the auditory feedback (Mitsuya et al., 
2017) and the occlusion of the headphones (Mitsuya & Purcell, 2016). 
Furthermore, MacDonald et al. (2010) highlighted a linear relationship 
between the magnitude of the perturbation and the magnitude of changes 
in speakers’ utterances for perturbation magnitudes up to +200Hz in F1 
and -250Hz in F2, compensating for 25 % of the perturbation in F1 and 
30 % in F2. With larger perturbations, there was no improvement, and a 
decrease even appeared in response to perturbations larger than 300Hz in 
F1 and larger than 400Hz in F2. Similar limits were observed by Katseff 
and colleagues (Katseff & Houde, 2008; Katseff et al., 2012), as discussed 
in the next section. Comparable adaptations were reported in the meta- 
analysis provided by MacDonald et al. (2011), with an average of 26.5 % 
for F1 and 23.2 % for F2. Moreover, in this last analysis, changes in F1 in 
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speakers’ production weakly correlated with changes in F2, suggesting a 
specific control of the two parameters and the existence of speaker-specific 
strategies. The magnitude of the response was also found to vary according 
to the vowel in pet, bus and law utterances in Max et al. (2003). Further 
work addressing this last point with regard to more specific research topics 
is presented in the next section. 

Houde and Jordan also noticed that inter-speaker variability was not 
related to a speaker’s awareness of the auditory shift. When interviewed 
after the study, talkers reported they were unaware of the perturbation or 
of any change in their production. By contrast, Purcell & Munhall (2006a) 
reported that 40 % of their participants indicated awareness of “some kind 
of change in the auditory feedback over the course of the experiment”, 
with only 8 % noticing that the perturbation transformed the vowel into 
a different one. However, the magnitude of adaptation did not seem to be 
related to the responses in this interview. This difference to Houde and 
Jordan might be related to the abrupt suppression of the perturbation after 
training in Purcell & Munhall (2006a) (Procedure P1, Figure 3) that was 
probably perceived by the speakers, while Houde and Jordan assessed how 
adaptation was sustained using catch trials with masking noise (Procedure 
P2, Figure 3). Munhall, MacDonald, Byrne, & Johnsrude (2009) then con- 
firmed that the awareness of the perturbation does not influence adaptive 
behavior, as discussed later in the “Surface effects & speakers’ character- 
istics” subsection. 

Another important result in Houde and Jordan was that changes for 
perturbed utterances were larger than changes for utterances produced 
with a masking noise. The authors discussed this result as evidence that 
“vowel production could be partly under immediate auditory feedback 
control” (Houde & Jordan, 2002, p. 307). By contrast, in their prelimi- 
nary study of adaptation to a shift of all formants in the same direction, 
Max et al., (2003) argued that the modifications in talkers’ production 
should be considered as adaptive responses rather than reactive changes, 
as they already occur at vowel onset, and have been observed for sustained 
vowels as well as vowels with shorter duration. The variability of changes 
in formants according to the vowel’s parts were not systematically investi- 
gated in adaptation studies as most of the studies used a single steady-state 
value, often around the middle of the vowel. However, in their preliminary 
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work, Berry, Jaeger, Wiedenhoeft, Bernal, & Johnson (2014) suggested 
that this single value might not be the most appropriate, depending on 
consonant context and coarticulatory effects. Vaughn and Nasir (2015) 
also provided evidence that full trajectory analysis might better capture 
adaptation phenomena. The relationship between formant values in con- 
secutive trials (as measured with one-lag cross correlation analyses), in the 
absence of any perturbation, may also be predictive of adaptation mag- 
nitude (Purcell & Munhall, 2006a). Altogether, these results suggest that 
changes observed over the course of adaptation to a perturbation result 
probably from a mix of feedback and feedforward control. 

Houde and Jordan (2002) suggested investigating compensation to 
formant perturbations in steady-state vowels to determine the role of 
online feedback in formant control. Studies focusing on compensation 
to an unexpected formant perturbation in sustained vowels usually ana- 
lyzed changes at different points of the vowel. For instance, in Purcell and 
Munhall (2006b) upward vs. downward perturbations of F1 were applied 
randomly in five utterances of “head” over 100 utterances of different 
CVC words. Results show partial compensation, with on average, 16.3 % 
vs. 10.6 % of the upward vs. downward shifts, but with high variability 
for the same talker between utterances and between talkers. However, this 
study was not designed to measure the delay in compensatory response. 
This delay was found in later studies to be around 160ms, at least when 
F1 is shifted upward (e.g. Tourville, Reilly, & Guenther, 2008), and when 
more complex spatial or temporal perturbations of formants trajecto- 
ries are applied during the production of short sentences (Cai, Ghosh, 
Guenther, & Perkell, 2011). The smaller compensation of perturbation 
observed in studies involving unexpected perturbation compared to studies 
involving systematic perturbation, as well as the delay required to observe 
a compensatory response, confirm the idea that responses produced in the 
presence of the perturbation in adaptation studies are at least partially 
adaptive. 

One of the most intriguing outcomes of Houde and Jordan (2002) was 
that the modification in formants was still present when talkers came 
back a month later to run a control study evaluating changes in produc- 
tion without perturbation. This long-term effect was attributed by the 
authors to implicit memory of the task or specific control mechanisms for 
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whispered speech. Although not reproduced in later work — as there was 
no study with equivalent long-term assessment in our review at least — 
Purcell and Munhall, (2006a) showed that 115 repetitions without per- 
turbation after the training phase were not enough to fully return to the 
baseline state. The explanation introduced by Houde and Jordan echoes 
the idea that auditory-motor learning could be specific to some situations 
or ways of speaking, as discussed in generalization studies. The ability to 
memorize specific ways of speaking according to the situation could be 
a way to support fast speech adaptability in known situations. This idea 
could be further investigated by means of transfer of adaptation from one 
context to another as discussed below. 

Finally, large inter-speaker variability was also pointed out in Houde and 
Jordan (2002) and then observed in all the subsequent studies. MacDonald 
et al. (2010, 2011) suggested that this variability is not clearly related to 
the variability in baseline production, nor to the size of the vowel space. 
Inter-speaker variability, as well as partial compensation, in formant adap- 
tation studies was often discussed in terms of a tradeoff between auditory 
and somatosensory feedback. For example, Purcell and Munhall (2006a) 
suggested that “Some [speakers] may rely more on kinesthetic feedback 
and thus are not influenced as much by acoustic feedback” (p. 975), while 
Houde and Jordan (2002) suggested “it may be that there are differences 
across participants as to the degree to which they rely on auditory feed- 
back” (p. 308). The tradeoff between auditory and somatosensory feed- 
back, as well as the role of sensory acuity in adaptation was then explored 
in several papers, as described in the next section. 


5.2. Perception acuity and sensory integration 


Formant perturbations’ paradigms involve modifying the auditory feed- 
back, i.e. sensory input of speech control system, and measuring the 
outcome in terms of speech production, or motor control. Hence these 
paradigms are by nature relevant to the question of the relationship 
between perception and production. Several aspects of this relationship 
have been investigated over the past two decades. 

First, adaptation to auditory perturbations may be influenced by 
speakers’ sensory acuity. Auditory acuity has been positively corre- 
lated with adaptation magnitude in two studies (Martin et al., 2018; 
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Villacorta et al., 2007) involving 13 and 31 subjects respectively. 
Auditory acuity measurements were based on discrimination tasks in 
both cases. Villacorta et al. focused on acuity for F1 while Martin et al. 
measured acuity based on pitch and loudness, as well as melody discrim- 
ination tasks. A possible interpretation of the relation between adap- 
tation magnitude and auditory acuity is that better acuity could lead 
speakers to have smaller goal regions for their production, resulting 
in higher adaptation (Villacorta et al., 2007). However, auditory feed- 
back may not be the only feedback used to control speech production. 
Feng et al. (2011) investigated the relationship between the adaptation 
magnitude of F1 and the auditory acuity for F1, as well as somatosen- 
sory acuity for jaw position. They did not find a reliable correlation. 
However, fewer subjects were involved in this study than in previously 
cited ones (8 subjects vs. 13 and 31). 

Feng et al. also combined a somatosensory perturbation induced by a 
robotic device pulling the jaw, with an auditory shift on F1. Using this 
procedure, they found that speakers mainly compensated for the audi- 
tory perturbation. They suggested that auditory feedback may be dom- 
inant over somatosensory input, but that their relative weight could 
evolve with speech experience. Using similar methods, Lametti, Nasir, and 
Ostry (2012) found that all speakers adapted for at least one of the two 
perturbations. The group who adapted to the somatosensory perturbation 
(half of the participants) did not significantly compensate for the audi- 
tory perturbation while the group that did not adapt to the jaw perturba- 
tion significantly compensated for the F1 shift. This observation suggests 
a speaker-specific sensory preference for either auditory or somatosensory 
inputs. In addition, the weights attributed to auditory and somatosensory 
feedback may vary according to the articulator (i.e. vocal folds, tongue 
or jaw) to control. Indeed, no correlation has been found in the magni- 
tude of adaptation in FO, F1 and F2 across speakers while altering them 
simultaneously or separately (Eckey & MacDonald, 2015; MacDonald & 
Munhall, 2012). Interestingly, Trudeau-Fisette et al. (2017) showed that 
speakers with blindness adapted more to an F2 shift than control speakers, 
independently of their auditory acuity, and that they also produced larger 
articulatory changes in response to the auditory shift. Speakers with 
blindness may rely more on auditory feedback than control speakers, who 
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may have more precise somatosensory goals, probably built and supported 
by visual perception of speech. 

However, sensory preference in the control of speech, which can be 
modeled by different weights attributed to each kind of sensory feed- 
back, may also evolve with experience. Most studies on auditory-motor 
adaptation report a partial compensation for the auditory perturba- 
tion as already mentioned in the previous section. Some studies showed 
that the percentage of compensation relative to the magnitude of 
the perturbation decreases when the magnitude of the perturbation 
increases, reaching an asymptote, and can even tend to decrease for 
larger perturbations (Katseff & Houde, 2008; Katseff et al., 2012; 
MacDonald et al., 2010). Katseff et al. (2012) interpreted this phe- 
nomenon as evidence that the weights attributed to auditory and 
somatosensory feedback may vary according to experience: “For small 
discrepancies between auditory and somatosensory feedback, auditory 
feedback takes precedence, and for large discrepancies between audi- 
tory and somatosensory feedback, somatosensory feedback takes prece- 
dence” p. 307. Thus, a high-amplitude shift may lead the speech system 
to consider auditory feedback as unreliable and therefore give more 
weight to somatosensory feedback. In addition, the relative impor- 
tance of sensory input may depend on the specific sounds produced. 
Several studies observed less compensation in closed vowels than in 
open vowels (Mitsuya et al., 2015; Purcell & Munhall, 2008; Reilly & 
Dougherty, 2013). This could be explained by better-specified somato- 
sensory information in the former than in the latter case (Mitsuya 
et al., 2015). Another possible explanation is that the importance of F1 
as an acoustic cue in perception may depend upon the vowel (Reilly & 
Dougherty, 2013). 


5.3. Perceptual and phonological categories 


Speech perceptual space is structured by phonological categories, which 
are delimited by perceptual boundaries. Niziolek and Guenther (2013) 
showed an effect of perceptual boundaries on the magnitude of compen- 
sation to unpredictable auditory perturbations. They observed that if the 
auditory signal resulting from the perturbation is near a boundary, the 
compensation, as well as the cortical activation, is higher than when it 
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is far from a boundary, the magnitude of the shift being equal. In ad- 
dition, various studies have investigated the relation between perceptual 
boundary and adaptation to sustained auditory perturbations. 

The influence of perceptual boundaries on adaptation can be inves- 
tigated using perceptual learning on the perceptual contrast that is at 
stake in the adaptation paradigm. For instance, Shiller, Lametti, & Ostry 
(2013) manipulated speakers’ perceptual boundaries between “head” 
and “had” through perceptual training preceding auditory-motor adap- 
tation to a perturbation consisting of altering “head” into “had” (see 
Procedure P1p on Figure 3). The group whose boundary was shifted 
towards “head” (i.e. who was more likely to classify ambiguous stimuli 
as “had”) adapted more to the auditory perturbation than the group 
whose boundary was shifted towards “had” by the perceptual training. 
Similarly, children adapted more to a perturbation transforming /beb/ 
into /bab/ after a perceptual training manipulating /e/-/e/ boundary 
towards /e/ than before training (Shiller & Rochon, 2014). They also 
adapted more than children having undergone a perceptual training on 
an unrelated contrast. Furthermore, Lametti, Krol, et al. (2014) observed 
in adults that the amount of adaptation to auditory-feedback pertur- 
bation was correlated with the position of the perceptual boundary 
obtained through perceptual training. 

Instead of using perceptual training, changes in perceptual boundaries 
were obtained by manipulating the pitch and formant of the carrier phrase 
“please say what this word is...” (Bourguignon, Baum, & Shiller, 2015, 
2016). In this study, the group exposed to high carrier-phrase (high pitch 
and formants) had the boundary between ‘bit’ and ‘bet’ shifted toward 
‘bet’. They adapted more to an auditory feedback alteration transforming 
/e/ into /1/ than the speakers exposed to low carrier-phrase (low pitch and 
formants). This finding suggests that “context-dependent plasticity in 
speech perception may also transfer to production” (Bourguignon et al., 
2016, p. 1040). Interestingly, Bourguignon, Baum, and Shiller (2014) 
also showed an effect of the lexical status that can be interpreted in terms 
of perceptual boundaries. In their study, a group of speakers produced 
pseudo-words that resulted in real word when auditory perturbation was 
applied (e.g. “kess” changed into “kiss”). Another group produced real 
words that were transformed into pseudo-words by the same formant shift 
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(e.g. “less” changed into “liss”). The first group showed greater adapta- 
tion than the second group, indicating a lexical effect on auditory-motor 
adaptation. 

The influence of phoneme categories on speech motor adaptation was 
also highlighted in cross-language studies. Mitsuya, MacDonald, Purcell, 
and Munhall (2011) contrasted the adaptation to upward and down- 
ward shifts in F1 in three groups: English speakers pronouncing “head”, 
Japanese speakers producing the Japanese word /he/ and Japanese speakers 
learning English, producing “head”. The magnitude of adaptation was 
equivalent in all groups in response to the downward shift, but the adap- 
tation was smaller in Japanese than in English speakers in response to the 
upward shift. This difference is evidence for the influence of the phonolog- 
ical system in adaptation. Mitsuya, Samson, Ménard, and Munhall (2013) 
also showed differences between English speakers and French speakers in 
the adaptive response to the same auditory perturbation. In this study, a 
perception test suggested that this language effect on adaptation was medi- 
ated by a difference in perceptual boundaries: larger adaptation in French 
speakers was related to greater sensitivity to some phonetic contrasts. 

Reciprocally, the influence of adaptation on perceptual boundaries has 
also been investigated. Lametti et al. (2014) incorporated perceptual tests 
in a classic auditory-motor procedure (Figure 3, procedure P1p), before 
and after the training phase — during which adaptation occurs — as well as 
after the after-effect phase, used here as a wash-out of adaptation. They 
observed that auditory-motor adaptation resulted in a shift of a percep- 
tual boundary in the phonetic range of what speakers produced but not 
what speakers heard. For instance, speakers who produced “head” and 
heard an auditory feedback shifted toward “had”, compensated by pro- 
ducing an utterance closer to “hid”. Their perceptual boundary between 
“head” and “hid” was shifted toward “head”, that is, speakers became 
more likely to report hearing ‘hid’ in the perceptual test, while there was 
no effect on the perceived boundary between “head” and “had”. This 
result suggested that the change in perception was specifically driven by 
speech motor adaptation and not by the auditory input during learning. 
The interpretation of these results, together with the results of other 
studies on the effect of auditory-motor adaptation on categorical percep- 
tion, was recently specified in a Bayesian modeling framework, suggesting 


Changes in speech production in response to formant perturbations 45 


that speech motor adaptation results both in speech sound remapping and 
changes in phoneme categories (Patri, Perrier, Schwartz, & Diard, 2018). 
Yet, using a similar paradigm to that of Lametti et al. (2014), Schuerman, 
Meyer, and McQueen (2017) did not find significant influence of auditory- 
motor adaptation on related perceptual boundaries. It should be noted 
that this experiment had fewer subjects than Lametti et al. (2014); was 
run with speakers of Dutch as opposed to English; and used a continuum 
with isolated vowels rather than a continuum between words, during the 
perceptual test. However, this last study also recorded EEG signals during 
initial vs. final perception tests. The analysis of ERPs to the stimuli of the 
/e/-// continuum revealed changes in N1 and P2 components for ambig- 
uous stimuli, which correlated with the magnitude of adaptation as mea- 
sured by F1. The effect on both N1 and P2 suggest that auditory-motor 
adaptation influences both early perception and late perceptual decisions. 
Interestingly, Schuerman, Nagarajan, and Houde (2015) and Schuerman 
et al. (2017) showed that the adaptation to an auditory perturbation of 
F2 shifting the front vowel /i/ towards the back-vowel /u/ resulted in a 


ce 


shift in the perceptual boundary between “see” and “she”. More spe- 
cifically, the shift in perceptual boundaries depended on the behavior of 
speakers during the adaptation task: speakers who followed the auditory 
perturbation had their perceptual boundary shifted in the opposite direc- 
tion to that of speakers who compensated for the auditory feedback. This 
last group was more likely to categorize ambiguous stimuli as “see” than 
“she”, the place of articulation of the consonant /s/ being more anterior 
than /f/. These findings are in agreement with the idea that some transfer 
of adaptation may occur between vowels and consonants articulated with 
a similar tongue position. 

While this impact of a change in production on the perception of 
another contrast is actually a transfer from production to perception, the 
term transfer is typically investigated in speech production itself, from one 
utterance to another. 


5.4. Transfer/specificity and speech units 


In the limb movement literature, generalization of motor learning is the 
“ability to correctly extrapolate to contexts that are different from our 
limited experience” (Krakauer, Mazzoni, Ghazizadeh, Ravindran, & 
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Shadmehr, 2006, p. 1798). This extrapolation could be the result of an inter- 
polation of previous experiences (Mattar & Ostry, 2007). Generalization 
has been extensively investigated in motor learning research, and in 
speech, in particular to address the specificity of motor adaptation and the 
underlying representations (Tremblay, Houle, & Ostry, 2008). Transfer 
of adaptation is usually defined as a positive generalization, as opposed to 
interference (Krakauer et al., 2006). However, we will use generalization 
or transfer to designate changes observed in untrained utterances after 
adaptive training, going in the same direction as adaptation. When no sig- 
nificant transfer is observed, changes related to adaptation are considered 
to be specific to the training utterance. 

The investigation of generalization or transfer of adaptation relied on 
two different motivations. The first set of work focused on generalization 
as a way to assess the global vs. specific nature of auditory-motor map- 
ping. This approach is derived from limb movement studies that analyzed 
generalization of visuomotor adaptation to address the global vs. specific 
nature of visuomotor mapping. The second set of work, that is sometimes 
an extension of the first one, considered generalization of auditory-motor 
learning as a way to assess the nature of speech production units, by 
questioning the linguistic level of auditory-motor mapping. This second 
approach was introduced by Houde and Jordan (1998) and is consistent 
with earlier work on transfer of perceptual learning to assess speech per- 
ception units (e.g. Chambers et al., 2010). 

Different procedures were used to investigate generalization of auditory- 
motor adaptation. The first one is structured in “epochs” (Figure 3, P2). 
Each epoch includes utterances with feedback on and utterances with a 
masking noise, which can be either the training utterances or different 
utterances. Transfer is evaluated at the end of the training phase, when 
the perturbation is maximal by measuring changes in transfer (or test) 
utterances as compared with their baseline. Using this procedure, Houde 
and Jordan (1998) found significant transfer from the training words 
sharing the same vowel /e/ (“pep”, “peb”, “bep”, and “beb”), shifted 
toward /1/ or /z/ to the various test words (same vowel as training words — 
“gep”, “peg”, “teg”, or different vowels — “pip”, and “pap”). The amount 
of transfer was variable depending on the test word, but not statistically 
different. Consistent results were reported in Villacorta et al. (2007), 
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where adaptation on the vowel /e/ for nine CVC words to an F1 pertur- 
bation, significantly generalized to the same vowel in different CVC or 
to the vowels in “pit”, “pat”, and “pot”. Results were less consistent for 
“put” and “pete” and seemed to depend on the direction of the perturba- 
tion. Still with a similar procedure, but with perturbation of F1 trajectory 
in speakers of Mandarin, Cai et al., (2008) and Cai et al. (2010) found 
gradients of generalization that depended on the similarity in formant tra- 
jectory between a training triphthong and the tested utterances. Finally, 
Reilly and Pettibone (2017) tested generalization from the vowels /i/ vs. 
/æ/ (embedded in a set of CVC utterances) to /i/, /e/ and /z/ (also in CVC) 
produced with a masking noise. In both training conditions, /e/ was the 
“near” vowel in test utterances, while /i/ and /æ/ were either the same as 
the training vowel or the “far” vowels, depending on the training condi- 
tion. Adapted speakers exhibited significant generalization to all vowels, 
regardless the training vowel. However, correlation between adaptation 
and generalization were unclear suggesting that generalization may depend 
on multiple factors and may be sensitive to inter-speaker variability. 
Similar procedures, mixing training and transfer trials, were used in 
limb movement studies. However, the approach was later criticized. In par- 
ticular, with this procedure, “the patterns of generalization observed are 
difficult to interpret, as transfer could reflect an averaging that takes places 
when subjects experience several training conditions simultaneously” 
(Rochet-Capellan et al., 2012 p. 1711). For this reason, other studies 
tested transfer after the training phase, when the feedback is turned-back 
to normal (Procedure P1t on Figure 3). In preliminary work, MacDonald 
et al. (2008) compared transfer tested in the course of training vs. after 
training. In both cases, speakers were trained on “head” shifted towards 
“had” and transfer was tested on the production of “hid” with unaltered 
feedback. When the transfer utterance “hid” was inserted during training, 
changes in “hid” were observed at the beginning of the training phase, but 
then its production came back to baseline. When tested after training, no 
change was observed at all in “hid”. Overall this suggests that adaptation 
is specific to the trained vowel, although it slightly depends on the training 
conditions. Pile, Dajani, Purcell, and Munhall (2007) then observed sim- 
ilar adaptation and lack of generalization toward “hid” or “hayed” (i.e. / 
hed/). Both studies were published in proceedings and were preliminary, 
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with restricted analyses. Later work by Rochet-Capellan et al. (2012) 
evaluated how adaptation to a perturbation of F1 in /pen/, /ben/, /ken/, / 
gen/, /ten/, /den/, /pan/, /pin/ then affect the production of /pen/ produced 
without perturbation. Results were consistent with previous work that 
tested generalization with a mixed procedure (Figure 3, P2): generaliza- 
tion was variable according to the training word and seemed to depend 
on the acoustical proximity between the training and the testing utterance. 
Another important result of this work was that the after-effect, assessed 
on the training utterance after the transfer phase, was still significant, 
suggesting that the production of the transfer utterance with normal feed- 
back did not wash out adaptation. This last result is consistent with the 
idea that learning is related to the training experience, and at least to some 
extent specific to this experience. 

Another way to assess specificity of adaptation is to evaluate how 
speakers can specifically compensate for several perturbations in the same 
training session. This approach is inspired by limb movement studies 
and in particular Osu, Hirai, Yoshioka, and Kawato (2004). In Rochet- 
Capellan and Ostry (2011), speakers produced “head” and “had” in 
random order with F1 shifted downward in “head” and upward in “had” 
and conversely (Procedure P3 in Figure 3). On average, speakers were 
able to change F1 frequency in opposite directions for “head” and “had”, 
suggesting that auditory-motor mapping is specific to each vowel. To as- 
sess whether auditory-motor mapping could be specific to a word, the 
authors then evaluated multiple adaptations for “head” and “bed” shifted 
in opposite directions and “ted” un-shifted. Again, on average, specific 
adaptation in opposite directions were observed for “head” and “bed” 
while F1 in “ted” remained unchanged, suggesting that different auditory- 
motor mappings could be built for a same vowel in different words. Similar 
results were obtained recently by Klein, Brunner and Hoole (in this book) 
with a Russian vowel in /d/ vs. /g/ CV syllables and a perturbation of F2. 
The authors also provided analysis of speakers’ data showing symmetrical 
vs. asymmetrical profiles of adaptation. 

Altogether, these results suggest that generalization of auditory motor 
adaptation occurs in a way that depends on the similarity between the 
training and the testing utterance and that specific control can be achieved, 
at least under specific conditions. The results were interpreted as an 
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indication of global control for vowel production vs. specific control. 
Furthermore, generalization from a vowel to the same vowel in different 
contexts suggests that auditory-motor mapping could occur at the level of 
the phoneme. It is thus a way to question the structure of feedforward map- 
ping, and the nature of its underlying representations (Houde & Jordan, 
1998). The fact that transfer is in general smaller than after-effect suggests 
that word context may play a role. The idea that multiple representations 
may coexist in auditory-motor mapping of speech was directly assessed in 
recent papers by Caudrelier et al. (Caudrelier et al., 2016; Caudrelier et al., 
2018). In this work, several linguistic levels were contrasted by assessing 
transfer on test utterances that shared either the same vowel, and/or the 
same syllable or was the same word as the training utterance. Transfer was 
smaller (although significant) at the vowel level than transfer to the same 
syllable, which was lower than after-effect in the same word, suggesting 
that these three levels — words, syllables, phonemes —- could coexist in par- 
allel in the structure of the speech sound map. This conclusion is con- 
sistent with multiple traces connectionist models of long-term memory 
(Ans, Carbonnel, & Valdois, 1998; Carbonnel, Charnallet, & Moreaud, 
2010) in the sense that multiple units could emerge as common informa- 
tion of multiple experiences (Goldinger, 1998; Hintzman, 1986). Specific 
production of the vowel to the syllable or word context also questions the 
role of coarticulation in adaptation and transfer of adaptation, a topic 
introduced in a preliminary paper by Berry et al. (2014). 

In addition to the theoretical insights mentioned above, a better under- 
standing of generalization in speech may have clinical implications in 
speech rehabilitation (e.g. after stroke), since transfer from training with a 
speech therapist to daily life is essential (Aichert & Ziegler, 2013). Other 
clinical applications are described in the next section. 


5.5. Pathology affecting speech production 


Auditory feedback perturbation paradigms may be instrumental in the 
understanding of mechanisms underlying disorders related to or af- 
fecting speech production. In particular, low compensation or adaptation 
observed in patients with a given pathology is regarded as evidence for a 
lack of sensorimotor integration or as an impairment of feedforward con- 
trol mechanisms. 
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Stuttering is suspected to be driven by abnormal integration of sen- 
sory input in speech motor control, and has been an early target for 
auditory perturbation studies, and more recently for studies using for- 
mant perturbations. Cai et al. (2012) observed smaller compensation to 
unpredictable perturbation of formants in persons who stutter compared 
to control participants. The latency of compensation was however found 
to be equivalent in both groups. According to the authors, this suggests 
impairment of the inverse model responsible for translating auditory error 
detection into proper correction in motor commands. Reduced responses 
to formant perturbations were also observed in adaptation studies, with 
systematic perturbations. Sengupta, Shah, Gore, Loucks, & Nasir (2016) 
found smaller adaptation in adults who stutter as compared with con- 
trol speakers that was also related to anomalous EEG phase coherence. 
This hints at a miscommunication between speech sensory and motor 
areas, which confirms a potential deficit in sensorimotor integration in 
people who stutter. A recent study, Daliri, Wieland, Cai, Guenther, & 
Chang (2018) also found reduced adaptation in adults who stutter com- 
pared to control speakers. However, the difference was not observed in 
children who stutter as compared with their aged-match controls. These 
results suggest that reduced adaptation observed in adults may be a con- 
sequence of compensatory strategies induced by the pathology rather than 
a root cause. 

Terband, Van Brenk, and van Doornik-van der Zee (2014) used a sim- 
ilar adaptation paradigm as Daliri et al. (2018) with children with CAS 
(Childhood Apraxia of Speech). CAS was described as “a disordered devel- 
opment of the functional synergies/coordinative structures that underlie 
speech motor coordination causing impairment of the forward model 
leading to poor feedforward control” (Terband et al., 2014, p. 66). In 
agreement with this description, children with CAS were shown to follow 
the auditory perturbation on average, while their aged-match controls 
adapted to the perturbation by compensating for it. 

Van den Bunt et al. (2017) used formant adaptation to assess the nature 
of the phonological deficit observed in dyslexia, known as a “difficulty in 
acquiring fluent word-decoding skills” (p. 1). Adults with dyslexia showed 
greater adaptation and after-effects than control speakers to a formant 
feedback perturbation that doesn’t cross a phonemic boundary (i.e. an 
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allophonic perturbation). Moreover, a negative correlation was observed 
between reading skills and the magnitude of adaptation: the worse the 
reading score, the larger the adaptation. This result could be interpreted 
as a weaker perceptual magnet effect (Kuhl et al., 2008) in speakers with 
dyslexia and supports theories claiming that dyslexia is associated with a 
greater distinction between allophones, which may lead phoneme catego- 
ries to be less prominent. However, a condition with a perturbation crossing 
the phonetic boundary is required to further support this hypothesis. 
Compensation or adaptation to formant perturbations were also inves- 
tigated in populations with neurogenetic or neurodegenerative diseases. 
Demopoulos et al. (2018) used adaptation to formant perturbation to 
address the origin of the speech production deficit observed in young 
individuals with a subtype of autism (due to a 16p11.2 deletion). The 
adaptation was reduced in this population as compared with age-matched 
controls while compensation to unexpected perturbation of FO was larger. 
According to the authors, this suggests that feedforward models could 
be altered in people with 16p11.2 deletion, leading to an over-reliance 
on feedback control. A comparable profile of larger compensation to 
unexpected perturbation of FO was observed in patients with Parkinson 
Disease (PD). However, both compensation and adaptation to unexpected 
vs. constant formant perturbation were reduced in speakers with PD as 
compared with age-matched control speakers (Mollaei, Shiller, Baum, & 
Gracco, 2016; Mollaei, Shiller, & Gracco, 2013). The authors interpreted 
the difference in pitch and formant compensation in terms of somatosen- 
sory and muscle activation deficits of the larynx and oral cavity. This dis- 
sociation between compensation to FO vs. formant perturbations calls into 
question the conclusion of Demopoulos et al. (2018): as feedback control 
was only assessed with FO in speakers with 16p11.2 deletion, it remains 
unclear whether they indeed rely more on feedback control in general or 
if the effect was specific to FO control. Finally, Parrell, Agnew, Nagarajan, 
Houde, & Ivry (2017) found that speakers with cerebellum degeneration 
compensate for unexpected formant perturbations more than their age- 
matched controls, while they show weaker adaptation to sustained per- 
turbation. This suggests that the cerebellum plays an important role in 
feedforward control, and probably less in feedback control. The involve- 
ment of the cerebellum in feedback control is discussed in the next section. 
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5.6. Neural basis of speech motor learning 


The neural correlates of speech motor control and learning have been 
investigated through a variety of techniques, including EEG, fMRI, rTMS 
and tDCS. 

fMRI is not suitable to observe changes in the timeframe of adaptation to 
sustained perturbation because it could be confounded with low-frequency 
noise observed in fMRI (Zheng et al., (2013). However, it is feasible to inves- 
tigate the neural networks involved in feedback control using unpredictable 
perturbations. In Tourville et al. (2008), trials under altered auditory feed- 
back (as opposed to normal feedback) were associated with increased bilat- 
eral activation in posterior auditory cortex (including posterior Superior 
Temporal Gyrus, pSTG, and Planum Temporale, PT). This observation is 
regarded as evidence for the existence of auditory error cells, dedicated to 
detect errors in auditory feedback. The increased activation in right pSTG 
was observed to be enhanced when auditory perturbation outcomes were 
close to a perceptual boundary. In addition, Tourville et al., (2008) found 
increased right activation in ventral Motor and Premotor Cortex (vMC and 
vPMC, respectively) and anterior medial cerebellum (amCB). This suggests 
that feedback control involves mainly the right hemisphere whereas the left 
hemisphere, which is known to be dominant in speech production, would be 
mainly associated with feedforward control. Zheng et al., (2013) conducted 
further {MRI investigation. Their experimental procedure consisted of pro- 
duction trials with normal feedback, altered feedback (with F1 shift) and 
feedback with masking noise. Speakers then passively listened to every 
signal corresponding to their auditory feedback in the production session. 
Combining fMRI with an analysis of neural pattern similarity analysis 
enabled differentiation of three functional networks: an error signal net- 
work (including right AG, right SMA, and bilateral cerebellum), a passive 
listening network, and a network responding to both production and pas- 
sive listening conditions, that may correspond to sensorimotor integration, 
located in bilateral Inferior Frontal Gyrus (IFG). 

The Inferior Parietal Lobe (IPL), which comprises Supramarginal 
Gyrus (SMG) and Angular Gyrus (AG) may be involved in multisensory 
integration. An rTMS stimulation applied over the SMG just before the 
auditory-motor adaptation procedure reduced adaptation responses in 
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comparison with a sham stimulated group (Shum et al., 2011). Similarly, 
a tDCS stimulation applied over IPL affected auditory-motor adaptation 
(Deroche, Nguyen, & Gracco, 2017). More specifically, anodal stimula- 
tion aiming at facilitating neuronal excitability resulted in stronger adap- 
tation magnitude whereas cathodal stimulation, which has an inhibitory 
effect, prevents auditory-motor adaptation to predictable perturbations. 
Lametti, Smith, Freidin, and Watkins (2018) investigated the specific role 
of two areas involved in motor control, the cerebellum and the premotor 
cortex. In this experiment, anodal tDCS was applied during the baseline 
phase and the training. The auditory perturbation consists of an F1 shift 
making the training words “bed”, “head” and “dead” sound more like 
“bad”, “had” or “dad”, respectively. Stimulations over either motor cortex 
or cerebellum were both found to lead to higher adaptation and/or after- 
effect than in the sham-stimulated group. Interestingly, stimulation over the 
cerebellum increased error compensation on F1, while stimulation of the 
motor cortex also led to adaptation in F2. Adaptation in F2 when altering 
F1 only has been reported for the front vowel /e/ with variable size-effects 
(MacDonald et al., 2011; Rochet-Capellan & Ostry, 2011; Villacorta et al., 
2007). Changing F2 in answer to a perturbation of F1 may be a strategy to 
reach an appropriate phoneme auditory category, as F1 and F2 vary at the 
same time in the contrast of front vowels. Thus, the cerebellum is suggested 
to contribute to error correction only, while motor cortex may lead to more 
general adaptation, possibly related to previously learnt movements. 
While rTMS and tDCS can reveal the functional role of a specific 
brain area, neuronal oscillations as observed in EEG combined with 
phase coherence analysis may provide insights into the communication 
between brain areas as proposed by Sengupta and Nasir (2015). Phase 
coherence over a specific brain area can also represent a measure of this 
area’s engagement. In this study, a redistribution of phase coherence in 
specific frequency bands (theta and gamma bands) occurred at the end of 
the training phase and was related to the amount of speakers’ adaptation. 
This phenomenon was interpreted as a sign of the establishment of a new 
feedforward map (i.e. associating an auditory target to a motor gesture 
that enables the speaker to reach it) together with increased engagement 
of sensorimotor areas. Sengupta and Nasir (2016) then found that by late 
training, power in specific frequency bands during speech planning and 
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speech production was related to whether speakers were adapting to the 
auditory perturbation or not. Finally, Sato and Shiller (2018) analyzed 
event-related potentials (ERPs) during adaptation to an increase of F1. 
They observed that electro-cortical potentials at certain temporal windows 
(N1, P2) amplitude mirrors adaptation, as larger adaptation magnitude 
correlated with smaller N1/P2 amplitude. This larger speaking-induced 
suppression with learning was interpreted as an indication of auditory pre- 
diction during speaking. 


5.7. Speech development 


Auditory perturbation is an artificial way to generate speech learning, which 
otherwise occurs in natural situations: learning a new language, as well as 
during the development of speech. Studying adaptation to perturbations 
in typical adult speakers might help understand potential mechanisms 
occurring in these natural situations. It also questions the way children 
learn speech sounds. Daliri et al. (2018) and Terband et al. (2014) studied 
adaptation in atypical development, as reported in the “Pathology” section. 
Shiller and Rochon (2014) investigated the relation between adaptation on 
perceptual boundaries in children, as reported in the “Perceptual and pho- 
nological categories” section. MacDonald, Johnson, Forsythe, Plante, and 
Munhall (2012) and Terband and Van Brenk (2015) focused on adaptation 
in typically developing children at different ages. Terband and Van Brenk 
(2015) found greater adaptation in 4 to 9-year-old children than in adults, 
although the magnitude of adaptation did not correlate with age in the 
group of children, and the proportion of children exhibiting a consistent 
compensatory response was lower than in adults. MacDonald et al. (2012) 
showed that 4-year-old children adapted to a sustained perturbation with a 
similar magnitude of adaptation as adults, whereas 2-year-old toddlers did 
not adapt at all. This could suggest that toddlers ignore their own auditory 
feedback to focus on external stimulation or have an immature feedfor- 
ward control. According to Messum and Howard (2012), this observation 
contradicts the widely held view that children learn speech sounds by imi- 
tation, which would require them to listen to what they produce and try to 
make it match what they want to imitate. Instead, it supports the idea that 
a child learns to speak thanks to a tutor: “Mothers reflect (or mirror) what 
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their children say, but such imitation generally takes the form of reformu- 
lation into well-formed sounds of the ambient language, rather than simple 
mimicry” (Messum & Howard, 2012, p. 160). Thus, plasticity observed in 
adults in the situation of adaptation to auditory perturbations may be dif- 
ferent in nature to what occurs in the early speech development. 


5.8. Surface effects and speakers’ characteristics 


Other effects related to speakers or context, like the characteristics of the 
prompt during the adaptation procedure, may influence speech adapta- 
tion. Alsius, Mitsuya, Latif, and Munhall (2017) investigated the influence 
of the stimulus used to prompt the training word “head” by contrasting 
visual and auditory modalities as well as linguistic vs non-linguistic 
prompts. No effect of the sensory modality was found on the magnitude 
of adaptation but linguistic prompts (“head” as a spoken or written word) 
were found to induce more adaptation than non-linguistic prompts (a 
cross or a tune). Similarly Sato and Shiller, (2018) found no difference 
in the magnitude of adaptation between visual and auditory modalities. 
In addition, Caudrelier et al. (2018) investigated whether naming a pic- 
ture or reading a word aloud would make a difference in adaptation and 
in transfer. Although no effect was found in the adaptation response, the 
pattern of generalization was influenced by the prompt used during the 
transfer phase, regardless of the training prompt, hinting at possible sur- 
face effects. 

With regards to speakers’ abilities, Martin et al. (2018) found no cor- 
relation between general executive control and adaptation magnitude. In 
a preliminary study, Dimov, Katseff, and Johnson, (2012) investigated the 
influence of speakers’ characteristics including some social and personal 
aspects. In particular, less empowered subjects were found to adapt more 
than more empowered ones. Finally, Munhall et al. (2009) reported equiv- 
alent adaptation in naive speakers and in speakers who were informed of 
the shift and who were asked to compensate or not. These results suggest 
that auditory-motor recalibration is at least in part an automatic process. 
More work is required to better understand the complexity of adaptive 
profiles that might be determined by numerous factors, as discussed in the 
next section. 
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6. Research outlook on formant perturbations 


In this section, we identify some perspectives for future studies in adap- 
tation to formant perturbations, in relation to methodological aspects as 
well as to some of the reviewed research questions. 


6.1. Toward standards to investigate and report 
adaptation to formant perturbations 


Various interests have motivated adaptation to formant perturbations 
studies in various teams. This induced the use of different methods to alter 
formants but also different procedures and analyses. These methodolog- 
ical differences often make studies difficult to compare directly. Therefore 
some standards should be developed, in particular to facilitate meta- 
analyses of formants perturbations studies, at least with regards to the way 
to report the methods and the results. Munhall, Purcell and collaborators 
studies are very interesting in this regard, as they have involved a signifi- 
cant number of speakers and have used similar methods to alter formants 
to run the adaptation and to analyze the data. A number of questions 
should be taken into consideration when designing and reporting studies. 
Some of them may also require further methodological studies, in line with 
Munhall and collaborators work. For instance: 


- Should the participants be only females or males? What is the effect of 
mixing vs. not mixing gender on adaptation? 

- This first question could be crossed with the effect of the type of pertur- 
bation: should the perturbation be absolute vs. relative, formant values 
being clearly different across gender? What is the effect of shifting only 
F1 vs. F1 and F2 in opposite direction? 

- Whether participants are monolingual or multilingual should be con- 
trolled and reported, and as far as possible kept available for meta- 
analysis. Indeed, adaptation seems influenced by perceptual categories, 
which are related to phonological systems of languages. One of the best 
ways to address the question would be to be able to compare large 
datasets recorded around the world in the different research topics. 

- What is the real effect of bone conduction on adaptation? This question 
has not been addressed systematically, although it has been considered 
in the conception of apparatuses to shift formants. Most studies used 
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quite high sound intensity of feedback and/or mixed the signal with 
noise. The effect of the feedback level, the signal to noise ratio as well as 
the type of noise on adaptation were not systematically reported. 

- What is the real effect of the perturbation on the signal heard by 
speakers? This question is rarely investigated in papers, while the 
obtained perturbation can be far from the expected one (Mitsuya et al., 
2015). In particular, when using existing packages such as Audapter, 
delay in feedback should be checked, as it could depend on the prop- 
erties of the OS and computer hardware. The evaluation of formants 
provided by the tool, especially when applying unusual shifts, should be 
verified as there is no guarantee that the system will be able to track and 
shift the formants in the expected way. This is true for all the systems 
and could be easily verified by comparing the obtained formant values 
with corresponding spectrograms or with values assessed by an inde- 
pendent formant assessment software. This approach was used in Reilly 
and Pettibone (2017). 

- Due to the high variability in adaptation magnitude between 
participants, apparent differences on some parameters of adaptation 
between conditions are often found to be non-significant. Some effects 
and, in particular, surface effects such as visual vs. audio prompts might 
exist but may require testing a large number of speakers to reach sig- 
nificance. This could also be the case for effects related to the direction 
of the perturbation or to the number of trials during the hold phase 
as well as to the way the perturbation is introduced. At the very least, 
non-significant results between different groups of speakers should be 
interpreted carefully, in relation to this large variability. 


These examples suggest that methodological aspects should be directly 
addressed and clearly reported to help teams working in the field share 
standards and enable the constitution of large databases. Large between- 
subjects variability suggests that adaptation to formant perturbations is 
a complex phenomenon, influenced by different factors. Multifactorial 
analyses such as introduced in Dimov et al. (2012) could be run on large 
datasets, but this requires — at the very least — recording of systematic 
information about the participants and reporting clear information about 
the perturbation and its real effect. 
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6.2. Topics which will benefit from further investigation 


Due to the broad range of research topics addressed by formant perturba- 
tion studies, more studies are still required to reproduce or better under- 
stand some results. This is particularly the case for the effect of adaptation 
on categorical perception, as results between studies have been sometimes 
inconsistent. Only a few studies were published on the effect of adapta- 
tion to formant perturbations on categorical perception of speech (Lametti 
et al., 2014; Schuerman et al., 2015; Schuerman et al., 2017; Schuerman, 
Nagarajan, et al., 2017) with some inconsistent findings between Lametti 
et al. (2014) and Schuerman et al. (2017). The two studies were run with 
speakers of different languages (English vs. Dutch) and with different types 
of continua for the perceptual test (words vs. vowels). It would be useful to 
gain more awareness of other attempts with non-significant or inconsistent 
profiles of perceptual changes following adaptation if any exist. This will 
avoid a publication bias towards significant-only results that seems to be 
a sensitive topic for this research question, in particular as the effects of 
speech production on changes of categorical boundaries may be sensitive 
to numerous variables, including the number of speakers, their gender, 
regional accent, languages skills etc. Replication is also required as the 
involvement of the motor system in perception is an important challenge 
for speech research more generally. 

Investigating the development of feedback and feedforward control sys- 
tems and their potential interaction in typically developing children is also 
an important topic to further develop using formant perturbations par- 
adigm. Moreover, using compensation to unpredictable perturbations in 
conjunction with sustained perturbations in atypical speakers may shed 
light on the root causes of some pathologies affecting speech production. 
For instance, van den Bunt et al. (2017) provides a rather convincing 
explanation about the sensorimotor bases of dyslexia, which could be fur- 
ther investigated in children. As adaptation has been shown to interact 
with phoneme categories, it allows investigating the development of pho- 
nological categories in both typical children and children with phonolog- 
ical disorders. 

An important topic also under-investigated so far is the influence of 
extraneous factors (i.e. not directly related to language or speech) on 


Changes in speech production in response to formant perturbations 59 


auditory-motor adaptation. First results by Munhall et al. (2009) suggested 
that the magnitude of the compensation is relatively independent of the 
awareness of the experimental aim and that speakers compensate even 
when asked not to compensate. This suggests that adaptation is quite inde- 
pendent from higher cognitive functions such as attention. Martin et al. 
(2018) also found no significant contribution of general executive con- 
trol skills on adaptation. However, the preliminary work by Dimov et al. 
(2012) suggests that variables related to speakers’ social status may play 
a role. Further investigations linking working memory abilities, attention 
levels etc. to formant adaptation will help tackle the mainstream issue of 
the link between cognitive and sensorimotor functions. This topic, as a 
number of others, has already been investigated in adaptation or com- 
pensation to FO perturbation (Guo et al., 2017; Hu et al., 2015; Scheerer, 
Tumber, & Jones, 2015). Last but not least, results by MacDonald et al. 
(2012) showing that toddlers do not adapt and the associated discussion 
of this result by Messum and Howard (2012) suggest that the communica- 
tive context may also influence adaptation. The question was investigated 
in birds by Sakata and Brainard (2009) suggesting larger adaptation when 
the song is produced in presence of another bird but also in humans with 
other type of perturbations such as speech in noise (Garnier et al., 2010). 
Social context might thus be relevant to question the real nature of speech 
targets. 

An important topic not developed in this chapter is a systematic analysis 
of the results of formant perturbation studies in relation to current models 
of speech production. A joint analysis with the results of other auditory 
and somatosensory perturbation studies could improve our understanding 
of feedback and feedforward controls. 

Finally, as it is relevant to the link between learning and memory, we 
would like to emphasize that transfer of adaptation was under-studied 
so far, despite its potential to bring insight into the nature of speech 
representations. As already introduced in Houde and Jordan (1998), 
transfer of learning is an empirical tool to question the nature of speech pro- 
duction units. This approach should be better connected to the equivalent 
approach developed for perceptual learning (e.g. Chambers et al., 2010). 
As noted by Cai et al. (2010) patterns of transfer question the way models 
of speech production represent sensorimotor mapping: both significant 
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generalization effect, as well as gradient effects should be explained. These 
models should also be adapted to integrate results from transfer or mul- 
tiple adaptation studies suggesting that the mapping between auditory 
and articulatory domains could occur at different linguistic levels and be 
related in some way to the training word. But more generally, adaptation 
might be related to the episode of learning, as also discussed in Houde 
and Jordan (2002) when explaining the long-term effects of adaptation by 
implicit memory. We strongly believe that understanding the link between 
sensorimotor learning and memory would be a fruitful path towards 
understanding of embodied cognition and the links between language and 
speech. In any case, identifying the condition of specificity vs. general- 
ization of adaptation will clearly contribute to the debate on the nature 
of speech production representation and to the debate on the nature of 
internal models and their relation to sensorimotor memories. 


7. Conclusion 


Twenty years ago, Houde and Jordan introduced formant perturbations 
in auditory feedback as a new paradigm to explore speech production. 
This seminal study is cited by papers in various domains: speech produc- 
tion and perception in general, studies using other kind of perturbations 
related to speech (e.g. pitch alteration, vocal tract perturbation), motor 
control as well as vocalizations in animals. Moreover, it has inspired a 
whole research field which is still in expansion. In this review, we scanned 
all studies citing Houde and Jordan (1998, 2002) and selected 77 articles 
focused on formant perturbations. The perturbation systems designed for 
this purpose are reported and described in the review. The main research 
topics addressed in these studies are also explained, along with their main 
findings. 

The formant perturbation paradigm proved to be insightful in exploring 
the relationship between speech production and perception. First, the 
observation of responses to auditory perturbations has shed light on the 
role of auditory feedback in speech production, and the mechanisms that 
control it. Experimental findings have been incorporated in speech pro- 
duction models, although some results still need to be modeled. Altering 
both auditory and somatosensory feedback showed that both modalities 
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are integrated in the control of speech in a manner that may be specific 
to the speaker and/or to the task (e.g. the vowel to produce). Associating 
perceptual categorization tasks and training with formant perturbations 
revealed a close relationship and mutual influence between speech motor 
control and phonological categories, mediated by categorical perception. 

This relationship between motor control and linguistic units (e.g. 
phonemes or syllables) has also been explored by observing the general- 
ization of auditory-motor adaptation. Generalization, or transfer, has been 
observed from a vowel to the same vowel in different words, suggesting 
the existence of an underlying phoneme representation. While transfer 
may occur from one vowel to another, supporting the idea of broad gen- 
eralization in speech learning, the magnitude of transfer seems to depend 
on some similarity relationship between the training and the transfer 
utterances. Moreover, simultaneous adaptation to opposite perturbations 
has been observed in two different vowels and even in the same vowel in 
different words. This apparent contradiction may represent a challenge for 
speech production models, as it requires much flexibility in the translation 
of auditory goals into articulatory gestures, and questions the nature of 
mental representations interfacing with speech articulation. 

Studies in cognitive neurosciences have pinpointed neural correlates of 
sensory integration and motor control in speech production, in terms of 
brain regions as well as communication networks and frequency bands. 
While studying patients with cerebellar degeneration also contributes 
to this purpose, research in other pathologies, including stuttering, 
Parkinson’s disease, dyslexia, developmental speech disorders, and some 
autism subtypes, have benefited from formant perturbation experiments in 
understanding of the main causes and mechanisms underlying these spe- 
cific disorders. Finally, studying compensation and adaptation in children 
gives insights in the development of sensorimotor processes at stake in 
speech production. Effects of communicative situation or social context 
may also be explored, as it has proven influential in some speech motor 
control characteristics in adults. Further investigations in children in var- 
ious communicative contexts could eventually shed light on one of the 
most intriguing questions in our research field: how does a child learn 
to speak? 
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Beyond these core topics associated with Houde and Jordan’s paradigm, 
other questions have emerged in relation to speakers’ cognitive functions 
and social characteristics, as well as learning context. The prompt has 
been suggested to influence adaptation and transfer pattern. Moreover, 
Houde and Jordan had already noticed that adaptation was still there when 
speakers were tested one month later with normal feedback. This obser- 
vation may suggest that learning is to some extent specific to the context 
in which it occurs, the testing room for instance. This is consistent with 
multiple-trace memory models or exemplar-based views (Goldinger, 1998; 
Hintzman, 1986), according to which each event is recorded in the brain 
in the form of a trace combining multiples elements from sensory inputs. 
Being confronted with one of these elements may activate all the traces 
containing it, and therefore the other elements associated with it. Thus, 
the specific context of the testing room may reactivate the adaptation that 
had washed out in other contexts. Investigating retention of adaptation 
in various time ranges and contexts may pave the way to fruitful research 
exploring the relationship between speech, learning and memory. 
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Spatial and temporal variability of corrective 
speech movements as revealed by vowel 
formants during sensorimotor learning 


Abstract: Previous perturbation studies demonstrate that speakers can reorga- 
nize their motor strategies to adapt for articulatory or auditory perturbations 
(Savariaux, Perrier & Orliaguet, 1995; Rochet-Capellan & Ostry, 2011). However, 
across most studies we observe a fluctuating amount of inter-individual differences 
with respect to the adaptation outcome. To evaluate the predictions of the hypoth- 
eses put forward to explain these differences, we conducted a multidirectional 
auditory perturbation study investigating F2 perturbation with native Russian 
speakers. During participants’ production of CV syllables containing the close 
central unrounded vowel /i/, F2 was perturbed in opposing directions depending 
on the preceding consonant (/d/ or /g/). The bidirectional shift was intended to 
encourage participants to produce the vowel /i/ with two different motor strategies 
and allowed us to investigate intra-individual variation of adaptation patterns as 
a function of the perturbation direction and the consonantal context. To examine 
the evolution of the adaptation process, we performed generalized additive mixed 
modelling (GAMM) on the averaged and individual formant data using the 
experimental trials as discrete time points. In doing so, we were able to examine 
sudden changes in participants’ adaptation strategies, which appeared as non- 
linearities in the F2 curve. Our results suggest that previously formulated hypoth- 
eses regarding individual adaptation processes make empirical predictions which 
are not confirmed by the bidirectional perturbation data. Therefore, we propose a 
more general hypothesis that the successful adaptation is dependent on speakers’ 
ability to coordinate the perceived auditory errors with appropriate compensa- 
tory movements, which is influenced in turn by the complexity of the adaptation 
task. We discuss this hypothesis in the context of individual adaptation patterns 
and show that it not only can explain the inter-individual, but also the inter-study 
variability observed in previous perturbation studies. 


Keywords: auditory feedback, real-time perturbations, formants, variability, indi- 
vidual behavior, generalized additive mixed modelling, Russian 
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1. Introduction 
1.1. Perturbation and sensorimotor learning 


Picture the situation of taking a photo of beautiful lakeside scenery and 
accidently dropping your camera into the water. Despite your misfortune, 
you are lucky and can spot the camera within what appears to be a reach- 
able distance at the lake bottom. Hastily, you try to retrieve the camera 
but grab a few times beside it before you can actually take hold of it. 
Or even worse, you realize that the bottom that appeared reachable lies 
in fact much deeper below the water surface. In the described example, 
the coordination between your visual input and your hand movements is 
disrupted by the visual distortions caused by the different reflective angles 
between the water and the air. The fact that you can eventually grab the 
camera after a few attempts, assuming the lake bottom is indeed reachable 
by hand, provides evidence for the flexibility of the human sensorimotor 
system which is able to adapt for the visual perturbations and to find alter- 
native motor strategies to reach the intended goal. 

The same is mostly true for mechanical and auditory perturbations of 
speech. That is, when you later recall you tale of bad luck to a friend 
during the conference dinner, and you get upset about the unreasonable 
repair costs of your camera, you might speak with a mouth full of food. 
In this case, your articulators’ movements might be impeded by pieces of 
food which will force you to find alternative strategies to intelligibly artic- 
ulate the words you intend to utter. Or, in another scenario, you may have 
to increase the loudness of your voice to compensate for the loud conver- 
sation happening at the table next to yours. 

During experiments applying controlled perturbation, speakers have 
to produce speech under aggravated conditions, e.g., under blockage of 
their jaw movements or under altered auditory feedback. As in the ini- 
tial example with the hand-eye coordination, speakers need to coordinate 
errors transmitted by their sensory input with appropriate corrective artic- 
ulator movements to be able to retain intelligibility of their speech. In the 
case of speech, it is particularly intriguing which sensory channels (e.g., 
somatosensory, proprioceptive, or auditory) are involved in the process 
of adaptation. The answer to this question may provide a better under- 
standing of the different types of sensory information relevant for speech 
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production and ultimately the goals of articulator movements. Thus, the 
study of perturbed speech provides an empirical means to study the nature 
of speech sound representations as well as learning processes that occur in 
speech production. 


1.2. Outcome variability in perturbation studies of speech 


Despite the general ability of speakers to reorganize their motor strate- 
gies to retain the acoustic make-up of the intended speech sounds under 
aggravated conditions, the outcome of adaptation processes in speech 
exhibits high inter-individual and inter-study variability. For instance, 
Gay, Lindblom and Lubker (1981) examined participants’ productions of 
vowels when a bite block was inserted between their teeth. The authors 
found that speakers were able to adapt to these static perturbations with 
very little or no practice and produce acoustic outputs equivalent to 
their unperturbed speech. However, in a study by Savariaux, Perrier and 
Orliaguet (1995) when speakers’ lips were blocked with a tube during 
the production of the French [u] only six out of 11 speakers were able 
to partially compensate for the labial perturbation and only one speaker 
compensated completely by changing the constriction location from a 
velo-palatal to a velo-pharyngeal region. The remaining four speakers did 
not compensate at all. Similar variability of the experimental outcomes is 
also observed across other articulatory perturbation studies, e.g., by Baum 
& McFarland (1997), Jones & Munhall (2003), and Brunner, Hoole & 
Perrier (2011). To explain this variability, Savariaux et al. (1995) sug- 
gest that the varying degree of adaptation among participants is due 
to “speaker-specific internal representation of articulatory-to-acoustic 
relationships”. 

More recently, it has become possible to study speakers’ articulatory-to- 
acoustic relations by means of real-time perturbation of speakers’ auditory 
feedback. This methodology allows alteration of such acoustic parameters 
as fundamental frequency (f0; Jones & Munhall, 2000) and vowel 
formants (F1 and/or F2; Houde & Jordan, 1998; Purcell & Munhall 2006; 
Villacorta, Perkell & Guenther, 2007), and has the advantage that mul- 
tiple perturbation conditions can be tested within the same study without 
participants’ awareness of any systematic manipulations. For instance, 
Rochet-Capellan and Ostry (2011) perturbed the first formant (F1) in the 
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vowel /e/ in opposing directions depending on the experimental stimulus in 
which it was embedded (head or bed), while in a control stimulus (ted) the 
F1 remained unchanged throughout the experiment. The authors found 
that speakers were overall able to adapt for the three distinct F1 levels 
which means that during the study participants employed three different 
motor strategies to produce the vowel /e/. However, as with articulatory 
perturbation studies mentioned above, there is a noteworthy proportion of 
speakers, ranging from 10 to 20 % per study, who fail to adapt to auditory 
perturbations. Roughly speaking, these speakers exhibit two qualitatively 
different types of adaptation behaviors: either adjusting their response in 
the same direction as the applied perturbation, or hardly reacting to it. 

One of the more recent hypotheses put forward to explain the outcome 
variability observed in perturbation studies is the idea by Lametti, Nasir 
and Ostry (2012) that speakers have individual preferences for articula- 
tory or auditory feedback to control their speech production. To empiri- 
cally evaluate their claim, Lametti et al. (2012) investigated participants 
in different experimental conditions where the authors either perturbed 
participants’ jaw trajectories without altering their speech acoustics, or 
perturbed their auditory feedback, or applied both types of perturba- 
tion simultaneously. The authors found a negative correlation between 
the amount of articulatory and auditory adaptation which means that 
speakers who adapted to articulatory perturbations, adapted to auditory 
alternations to a lesser degree. 

However, Lametti et al.’s (2012) hypothesis conflicts with observations 
previously made by Ghosh et al. (2010) who investigated the relation 
between somatosensory and auditory acuity, where acuity stands for the 
degree to which speakers were sensitive to changes in articulatory and 
auditory feedback signals. Running contrary to the idea that speakers 
exhibit individual preferences towards auditory or somatosensory feed- 
back, Ghosh et al. (2010) found that both types of acuity positively corre- 
lated with each other as well as with the magnitude of produced sibilant 
contrasts. In the context of vowels, the latter finding was previously made 
by Perkell et al. (2004). Furthermore, auditory acuity has been shown to 
have an influence on the adaptation magnitude during auditory perturba- 
tion of vowel formants (Villacorta et al., 2007) as well as during articu- 
latory perturbation of sibilants (Brunner, Ghosh, Hoole, Matthies, Tiede 
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& Perkell, 2011). In contrast to Lametti et al.’s (2012) hypothesis which 
predicts that speakers who fail to adapt to auditory perturbations should 
virtually ignore them, individual differences in auditory acuity provide a 
way to explain partial compensations which are frequently observed in 
auditory perturbation studies. 

Another explanation for partial compensations was provided by Katseff, 
Houde & Johnson (2012) who suggest that these are the result of speakers’ 
attempts to integrate the altered auditory signal with the normal somato- 
sensory signal that speakers receive during a perturbation experiment. 
Similar to other authors (e.g., Sato, Schwartz & Perrier, 2014), Katseff 
et al. (2012) assume that vowel targets are defined as regions in a mul- 
tidimensional acoustic-somatosensory space. That is, when during audi- 
tory perturbation the acoustic parameters of speakers’ speech are diverted 
from the target, speakers will compensate for the acoustic error. However, 
their compensation will stop when the discrepancy between the auditory 
and somatosensory signals becomes too large. Katseff et al. (2012) sup- 
port their view by the observation that in their study of F1 perturbation 
the relative compensation magnitude decreased from 100 % for 50 Hz 
perturbations to 40 % for 250 Hz perturbations. An analogous finding 
was previously made by MacDonald, Goldberg & Munhall (2010) for F1 
and F2 perturbation. 

At this point, we would like to add that it is alternatively possible that 
it is not the discrepancy between the altered acoustic and somatosensory 
signals that is causing the incomplete compensation, but rather phys- 
ical restrictions which do not allow participants to compensate beyond 
a certain physical limit. For instance, it seems plausible that large F1 
perturbations could require speakers to push their tongue beyond phys- 
ical limits imposed by the palate, the upper incisors, or other parts of the 
vocal tract. 


1.3. The role of the adaptation task complexity 


Although the hypotheses reviewed above are based on different premises, 
they mostly ascribe the source of the inter-individual outcome variability 
to the mechanisms of speakers’ internal models of speech motor control. 
This approach leads to a situation in which each of the proposed hypoth- 
eses offers a potential explanation for the inter-individual adaptation 
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variability in the context of a specific perturbation task; however, none 
of them does actually provide a general account of the variability that is 
observed across different experimental tasks or conditions. In this specific 
situation, it appears necessary to investigate the question of whether the 
complexity of the adaptation task might have an impact on its outcome. 
Let us illustrate this point with an example. 

It is plausible to assume that the adaptation to bite-block perturbation 
during production of vowels (e.g., Gay et al., 1981) requires an articulatory 
adjustment that is more similar to the unperturbed condition compared to 
the case of lip-tube perturbation during the production of /u/ (Savariaux 
et al., 1995). During the first task, participants are merely required to 
lift their tongue more strongly than usual since their jaw, which normally 
assists at this task, is blocked. Furthermore, the direction of the compen- 
satory tongue movement does not change due to the perturbation. During 
the lip-tube perturbation, on the other hand, participants have to compen- 
sate for blocked lip rounding by retracting their tongue. This articulatory 
adjustment is less obvious as the articulator used to compensate for the 
perturbation and its movement direction are less associated with the usual 
articulatory configuration used to produce the intended sound. As a con- 
sequence, the adaptation process may take longer and fewer speakers are 
able to identify the appropriate articulatory adjustments to compensate 
for the perturbation. Therefore, in our current study we will also investi- 
gate the question of whether the outcome variability can be explained by 
speakers’ inability to coordinate the perceived auditory error with appro- 
priate corrective articulatory movements. 


1.4. Current study 


To investigate in more detail how speakers translate the altered auditory 
signal into corrective articulatory movements, we conducted a bidirec- 
tional auditory perturbation study with native Russian speakers. Unlike 
Rochet-Capellan & Ostry (2011), who investigated speakers’ adapta- 
tion to multiple F1 degrees, we focused our investigation on F2, which 
is, roughly speaking, an indicator of horizontal tongue displacement. In 
our experiment, participants had to produce the close central unrounded 
vowel /i/ embedded in CV syllables /di/ and /gi/. Depending on the pre- 
ceding consonant, F2 in /į/ was perturbed in opposing directions. 
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The bidirectional perturbation imposed higher adaptation demands on 
our participants since they had to coordinate their corrective movements 
in two different ways depending on the perturbation direction. Based on 
the hypothesis that higher task complexity influences the adaptation pro- 
cess, we expected to observe a high amount of exploratory corrective 
movements and possibly also spontaneous behavior changes in the course 
of the experiment. 

The combination between the place of articulation (alveolar vs. velar) 
and the perturbation direction (down vs. up) was counterbalanced 
between all participants which allowed us to control for the potential 
influence of articulatory restrictions associated with each syllable on the 
compensation. To investigate how quickly participants can adapt to ab- 
rupt and substantial magnitude changes in perturbation, we increased the 
perturbation amount in 150 Hz steps across three perturbation phases and 
excluded ramp trials (gradual changes of perturbation magnitude) from 
the experiment. 

Finally, to understand the spatial and temporal evolution of the adapta- 
tion process, we analyzed the formant data with generalized additive mixed 
models (GAMMs) which allowed us to observe non-linear changes in 
participants’ responses to perturbation. By doing this, we seek to overcome 
the shortcomings of previous perturbation studies which concentrate on the 
comparison of speakers’ performance between the beginning and the end of 
the adaptation task, i.e., in most extreme cases during the first and the last 
trial of the experimental session or more often during the first 15-20 and 
the last 15-20 trials of the experiment. Unfortunately, this aggregation ap- 
proach allows only for pairwise time-uncorrelated comparisons (e.g., Feng, 
Gracco & Max, 2011; Trudeau-Fisette, Tiede, & Ménard, 2017) while 
the evolution of the adaptation process is often presented only in explor- 
atory scatterplots in earlier studies (e.g., Rochet-Capellan & Ostry, 2011; 
Lametti et al., 2012; Mitsuya, Munhall & Purcell, 2017). 


2. Methods 
2.1. Participants 


18 native speakers of Russian (14 female and 4 male) without reported 
speech, language, or hearing disorders participated in the experiment. 
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All participants were recruited in Berlin. The mean age of the group was 
25.8 years (range 20-37). Participants had spent on average three years 
in Germany prior to the recordings. The study was approved by the local 
ethics committee and all speakers gave their written consent to participate 
in the study. 


2.2. Equipment 


For each experimental session, participants were seated in front of a 
19-inch monitor inside a sound attenuated booth. The monitor served to 
display the stimuli and experimental instructions which were presented in 
Russian. Participants’ speech was recorded with a Beyerdynamic Opus- 
54 neck-worn microphone and fed back via foam tipped E-A-RTONE 
3A earphones (Figure 1). The distance between participants’ mouths 
and the microphone was about 3-5 cm. The earphones attenuated the 
air-conducted sound by 25-30 dB while the feedback level was ampli- 
fied relative to the microphone gain to weaken potential effects of air and 
bone conduction. The feedback volume was fixed across all participants. 
However, it was not possible to quantify the feedback level in a precise 
and meaningful manner since actual feedback volume is expected to vary 
slightly due to such parameters as the length and the size of participants’ 
ear channels. Real-time tracking and formant perturbation were performed 
with AUDAPTER, which is a C++ audio signal processing application exe- 
cutable within a MATLAB environment (cf. for technical details Cai et al., 
2008). The delay of the feedback loop was approximately 14ms. The 
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Figure 1: (A) Scheme of the experimental set-up. (B) Foam tipped insert earphones. 
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original and perturbed audio signals were digitized and saved with a sam- 
pling rate of 16 kHz. AUDAPTER also stored data files which contained 
the formant values (F1, F2, and F3) tracked on each trial. 


2.3. Speech stimuli and experimental protocol 


For our study we chose Russian since its vowel inventory includes the close 
central vowel /i/ which is flanked within the F2 space on each side by the 
two phonemes /i/ and /u/. This constellation allowed us to investigate mul- 
tiple adaptation in // with bidirectional perturbation of the F2 frequency. 
The vowel /i/ has a special status in the Russian vowel system since it never 
appears in word initial position or after palatalized consonants (cf. Bolla, 
1981, p. 66). 

Each recording session lasted approximately 20-25 minutes and 
consisted of four experimental phases. Before the start of the first experi- 
mental phase, participants completed a few practice trials with unrelated 
speech material to assure they understood the task and were able to per- 
form it accurately. During a baseline phase, which lasted for 60 trials, no 
auditory perturbation was applied and participants were able to familiarize 
themselves with the experimental situation of receiving auditory feedback 
over earphones. On each trial, which had an approximate duration of 2 
seconds, participants were visually prompted to produce one of the four 
CV syllables /di/, /di/, /gi/, and /gu/. This was done to assess participants’ 
initial F1-F2 formant space. The inter-stimulus interval between the trials 
was approximately 1.5 seconds. The visual presentation of the stimuli was 
controlled by a customized MATLAB software package developed at the 
Institute of Phonetics and Speech Processing, LMU Munich. 

During the three following perturbation phases, each of which lasted for 
50 trials, participants produced CV syllables containing the close central un- 
rounded vowel /i/ embedded in the context of alveolar and velar consonants 
/d/ or /g/. Depending on the consonantal context, the F2 was perturbed either 
downwards or upwards on each trial of each perturbation phase. Within 
each perturbation phase all stimuli were presented in pseudorandom order. 
This means that a participant could experience one perturbation direction 
on one trial and the other direction on the immediately following one; also, 
the same perturbation direction was never applied on more than two con- 
secutive trials. The interaction between the place of articulation (alveolar vs. 
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velar) and the perturbation direction (downward vs. upward) was evenly 
counterbalanced between the 18 participants resulting in two experimental 
groups (A and B). The perturbation magnitude amounted to 220 Hz during 
the first perturbation phase and increased in each perturbation phase by 150 
Hz. Consequently, the perturbation magnitude was 370 Hz for the second 
perturbation phase and reached 520 Hz in the last phase of the experiment. 
The amount of perturbation did not change within each shift phase. There 
were no ramp trials between the perturbation phases. 

Participants were naive to the purpose of the experiment and were 
instructed to produce all syllables with prolonged vowels. The prolonga- 
tion of the vowels maximized the amount of time during which participants 
were exposed to perturbed vowels. To keep the prolongation duration 
consistent across participants, they were assisted by a visual go-and-stop 
signal during their production. The go-and-stop signal had the form of a 
frame. Between the trials, while the frame stayed red, the response syllable 
of the upcoming trial appeared on the display and stayed within the frame. 
When a trial started, the frame color turned green which gave participants 
the signal to begin with their response. 

Following the experimental session, all participants were asked if they 
noticed anything unusual in their auditory feedback during the experiment. 
A few of the participants reported that their pronunciation was different 
from what they are used to or that they perceived an acoustic difference 
between the syllables /di/ and /gi/. Most participants attributed these pro- 
nunciation differences to the effect of listening to their own speech on 
audio recordings, so when asked if and how these differences affected their 
production, participants reported to have ignored these. From previous 
research, however, it is known that participants are not able to voluntarily 
control their reaction to auditory perturbation even if they are told to 
ignore it (cf. Munhall, MacDonald, Byrne & Johnsrude, 2009). 

All recordings of 18 participants amounted to 3780 trials. The onset 
and offset of the vowel segment produced on each trial were labeled man- 
ually in MATLAB using its graphical input facilities. Subsequently, the 
formant trajectories were extracted from AUDAPTER’s data files based on 
the labeled onset and offset boundaries. A window with a length of 50 % 
of each formant trajectory centered at its midpoint was used to compute 
the formant means produced on each trial. 
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2.4. Data analysis 


All analyses were performed in R (version 3.4.1; R Core Team, 2017). 
During the data analysis, we first examined the general adaptation pattern 
that occurred over the course of the experiment in the syllables containing 
the central vowel /i/. Next, we looked at individual spatial and temporal 
changes of vowel formants due to the applied perturbation. Finally, by 
investigating participants’ initial F1-F2 vowel space, we evaluated the 
potential influence of the surrounding sound categories /i/ and /u/ on the 
individual compensation strategies. 

To examine average formant changes in participants’ production of the 
two syllables /di/ and /gi/ across the four experimental phases, we fitted a 
generalized additive model (GAM; Hastie & Tibshirani, 1987). A GAM 
is a significant extension of a generalized linear regression model which 
allows the modelling of non-linear relationships between the depen- 
dent and independent variables (Wood, 2017a). Therefore, GAMs are 
much more flexible compared to linear regression models. The non-linear 
relationships are modelled via complex functions (smooths) which are 
constructed from ten basis functions (e.g., linear, quadratic, and cubic 
functions) with an adjustable number of basis dimensions. The number of 
basis dimensions is a number which indicates the upper limit of how com- 
plex the constructed function can be and is estimated directly from the 
data during the modelling process. That means that the usage of GAMs 
does not require from the researcher a predefined specification of a certain 
(non-linear) function as it is derived directly from the data. To prevent 
overfitting of the data, i.e., modelling of functions which are too com- 
plex and therefore might obscure any generalizable patterns in the data, 
GAMs are estimated using penalized likelihood estimation and cross- 
validation (cf. for details Wood, 2006). In the case of cross-validation, 
several subsets of the complete data sample are created always excluding 
a single data point and the model is refitted to all of these subsets exam- 
ining how well it predicts the excluded data. One further advantage of 
GAMs is the possibility to include random effects into the model struc- 
ture to account for individual response variability across but also within 
speakers (cf. Baayen, Vasishth, Kliegl & Bates, 2017). To denote the 
inclusion of random effects in the fitted model, it is dubbed generalized 
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additive mixed model (GAMM). For a hands-on introduction to GAMMs 
with a focus on dynamic speech analysis see S6skuthy (2017). 

The GAMM offers three main advantages for analyzing the data from 
the current experiment. First, it is possible to analyze the data as a function 
of time which allows us to investigate the whole adaptation process rather 
than just its outcome. Secondly, the non-linearity of parameter smooths 
does not make any assumptions regarding the temporal or spatial charac- 
teristics of the adaptation process. Finally, the parameter smooths can be 
estimated including random effects which allows us to capture individual 
variability of the adaptation process. 

Prior to building the GAMM model, participants’ raw formant fre- 
quencies were normalized by subtracting each participants’ mean formant 
frequency produced during the baseline phase for the respective syllable (/di/ 
or /gi/). This was done to exclude participant-specific differences regarding 
their absolute formant magnitudes (e.g., due to gender differences). By 
means of this normalization, the average F1 and F2 values for /di/ and /gi/ 
were set at zero for the baseline phase. 

Subsequently, using the mgcv package (Wood, 2017b) we fitted one 
GAMM model for each formant (F1 and F2) with normalized frequencies 
averaged across all participants and all experimental trials as dependent 
variable. The data of the unperturbed syllables /di/ and /gu/, which were 
uttered by participants only during the baseline phase, were not included 
in the resulting GAMMs. All GAMM models were evaluated, interpreted, 
and visualized by means of the itsadug package by van Rij, Wieling, 
Baayen & van Rijn (2017). 

In the model structure, we included random factor smooths with an 
intercept split for the perturbation direction (upward vs. downward) in 
order to assess (potentially non-linear) individual compensation mag- 
nitude differences over the course of the experiment. The model also 
included a fixed effect which assessed the ‘constant’ effect of the pertur- 
bation direction independently from the temporal variation. The resulting 
models explained 46.6 % and 66.9 % of the variance in the F1 and F2 
data, respectively. In comparison, the model which did not include the 
random smooths (participant-specific temporal variation) but only random 
intercepts and random slopes explained only 31.2 % of the variance in 
the F2 data. Maybe somewhat surprisingly, the inclusion of the phase 
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number (shift 1, shift 2, and shift 3) as an interaction with the perturba- 
tion direction did not significantly improve the model fit. We also refitted 
the F2 model including an interaction between the perturbation direction 
(upward vs. downward) and the experimental group (A vs. B) which also 
did not improve the fit. In both cases, the goodness of fit was assessed by 
the Akaike Information Criterion (AIC; Akaike 1974). 

Following the suggestion in Baayen, van Rij, de Cat & Wood (2016), 
the fitted models were investigated for the presence of autocorrelation in 
their residuals. Autocorrelation in the present study represents the correla- 
tion between the formant frequencies produced by one participant on two 
consecutive experimental trials. The higher the autocorrelation value is the 
less amount of information is contributed for the statistical model by each 
additional experimental trial. Ignoring this issue might result in overcon- 
fident estimates of the standard errors, confidence intervals, and p-values. 
The amount of autocorrelation at lag 1 was relatively moderate in the 
present data with 0.2 for F1 and 0.17 for F2. The effect of autocorrela- 
tion was practically reduced to zero by incorporating AR(1) error models 
in the specification of the fitted GAMM models. The corrected models 
explained 23.1 % and 63.4 % of the variance in the F1 and F2 data, 
respectively. The dropped percentages of the explained variance are due to 
the refitted models taking into account the autocorrelation which makes 
their prediction about actual frequency values worse. This is especially 
true for the F1 model which is an indication that much of the variance in 
the initial model can be explained by autocorrelated errors rather than by 
the specified model parameters such as the direction of the applied pertur- 
bation. Visual model inspection revealed that the residuals of the adjusted 
GAMMs followed a normal distribution for F1 and F2 data. 

To examine individual spatial and temporal differences of the adapta- 
tion process, we extracted F2 curves estimated for each participant by the 
GAMM model described in the above paragraphs. 

In order to evaluate whether the occurrence of certain individual com- 
pensation patterns was induced by sound categories surrounding the 
perturbed vowel, we investigated participants’ F1-F2 space using their 
baseline phase production. For this purpose, we fitted two linear-mixed 
models using the lme4 package (Bates, Machler, Bolker, and Walker, 2015). 
One model was fitted for each of the two average formant frequencies (F1 
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and F2) that were produced by participants in the syllables /di/, /di/, /gi/, 
and /gu/ during the baseline phase. The model structure included the pro- 
duced syllable and the interaction between the syllable and gender as fixed 
effects and the formant frequency as dependent variable. Furthermore, 
both models included an interaction between the syllable and the compen- 
satory pattern observed for each participant (cf. section 3.2 for a detailed 
discussion of individual compensation patterns). Random intercepts were 
modeled for each participant as well as random slopes for each produced 
syllable. 

Visual model inspection revealed that the residuals of the chosen models 
followed a normal distribution for F1 and F2 data. P-values were obtained 
with the ImerTest package by Kuznetsova, Brockhoff, and Bojesen- 
Christensen (2016). 


3. Results 
3.1. Overall compensatory behavior 


The GAMM estimated for F1 suggested that the applied perturbation did 
not have a ‘constant’ effect on the produced F1 values since its average did 
not significantly differ from the baseline on trials with upward (2.97 Hz, 
t=1.09, p > .05) as well as on trials with downward perturbation (-1.14 
Hz, t=-0.32, p > .05). These values represent ‘constant’ F1 differences for 
the whole experiment since they do not take into account any changes 
that appeared over time. Taking the temporal variation over the course 
of the experiment into account, the model did not reveal a F1 difference 
from the baseline for either of the two perturbation directions (Figure 2A). 
Furthermore, a direct comparison between trials with applied upward and 
downward perturbation revealed no significant difference in their F1 curves 
(Figure 2B). The average F1 difference amounted to 0.96 Hz (95 % CI 
[-6.03 7.94]) by the end of the first shift phase, 1.90 Hz (95 % CI [-6.14 
9.95]) by the end of the second shift phase, and 2.93 Hz (95 % CI [-7.86 
13.72]) by the end of the experiment. Random non-linear smooths of the F1 
model suggest that there were unsystematic participant-specific F1 changes 
which are most likely not related to the applied perturbation (Figure 2C). 
The absence of systematic compensatory effects in F1 is expected as no 
F1 perturbation was applied during the experiment. This outcome provides 
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Figure 2: Visual summary of the fitted GAMM model for F1: (A) Average 
compensatory effects (excluding random participant effects) in F1 for downward 
and upward perturbation over the course of the experiment. Grey bands 
represent 95 % confidence intervals. (B) The average difference in F1 between 
trials produced under opposing perturbation directions over the course of the 
experiment. Grey bands represent 95 % confidence intervals. (C) Random 
smooths estimated for each participant for her/his average F1 curve split by the 
perturbation direction. 


additional support for the validity of the applied experimental manipula- 
tion and the assumption that any systematic effects found for F2 are due 
to the application of the bidirectional perturbation. Due to the absence of 
compensatory effects in F1, we will not discuss this variable any further. 
The GAMM estimated for F2 suggested that the applied perturba- 
tion had a ‘constant’ effect on the produced F2 values on trials with 
upward (-127.58 Hz, t=-5.76, p < .05) as well as on trials with down- 
ward perturbation (143.14 Hz, t=5.36, p < .05). The direction of the 
‘constant’ effect was opposed to the direction of the applied pertur- 
bation during upward and downward perturbation. Examining the 
effect of the perturbation over time, the model revealed that this effect 
increased for both directions (upward and downward) over the course 
of the experiment (Figure 3A). On average, however, the effect appears 
to be stronger for the upward perturbation compared to the downward 
perturbation. The F2 difference between trials produced under opposite 
perturbation directions became significant after the baseline phase and 
increased, as expected, over the three perturbation phases (Figure 3B). 
The average F2 difference amounted to -131.51 Hz (95 % CI [-183.5 
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-79.52]) by the end of the first shift phase and to -193.61 Hz (95 % CI 
[-254.79 -132.44]) by the end of the second shift phase. By the end of 
the experiment, the average F2 difference reached -261.13 Hz (95 % CI 
[-345.41 -176.84]). The model suggested that the average compensatory 
effect in F2 can be modeled by linear functions for both perturbation 
directions as the estimated degrees of freedom (EDF) for both smooth 
terms amounted to 1. On the other hand, the random smooths fitted 
for individual participants exhibited a high degree of non-linearity for 
upward (EDF= 116.66) and downward (EDF = 97.76) perturbation 
directions. 

The random F2 smooths fitted individually for each participant dem- 
onstrate that above and beyond the general tendency to counteract the 
applied perturbation, participants’ adaptation patterns exhibited high var- 
iability in both investigated dimensions (formant frequency and time). For 
instance, the individual smooths refined the general observation that the 
downward perturbation caused on average weaker compensatory effect 
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Figure 3: Visual summary of the fitted GAMM model for F2: (A) Average 
compensatory effects (excluding random participant effects) in F2 for downward 
and upward perturbation over the course of the experiment. Grey bands 
represent 95 % confidence intervals. (B) The average difference in F2 between 
trials produced under opposing perturbation over the course of the experiment. 
The solid thick line denotes the region where the F2 difference was significant. 
Grey bands represent 95 % confidence intervals. (C) Random smooths estimated 
for each participant for her/his average F2 curve split by the perturbation 
direction. 
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over time. In Figure 3C, it is apparent that for most participants the solid 
lines (F2 curves produced under downward perturbation) remained closer 
to the baseline compared to the dashed lines (F2 curves produced under 
upward perturbation). 

To understand these participant-specific differences, we will examine 
and discuss individual adaptation patterns in more detail in the next 
section. 


3.2. Individual compensatory patterns 


As revealed by the individual F2 curves estimated by the GAMM model, 
the most distinct characteristic among participants was the magnitude of 
their compensation for the downward perturbation. Based on this metric, 
we identified five participants who throughout the experiment were com- 
pensating for the downward perturbation to the same extent as for the 
upward perturbation and 10 participants who compensated less (if at 
all) for the downward perturbation compared to the upward perturba- 
tion. In Figure 4, the first group (‘symmetrical’ compensation pattern) is 
represented by participants 3, 9, and 13, while participants 4, 16, and 17 
can be considered to represent the second group (‘asymmetrical’ compen- 
sation pattern). 

Examining subfigures for participants 3, 9, and 13, we see that the 
F2 curves for the two syllables /di/ and /gi/ diverged by equal amounts 
from the baseline as the experiment progressed. For participants 4 and 
16, on the other hand, the F2 curve produced under the upward pertur- 
bation diverged more strongly from the baseline. In contrast, the F2 curve 
produced under the downward perturbation appears to have fluctuated 
around the baseline. For participant 17, the effect of the perturbation 
direction appears to be flipped with stronger compensation for the down- 
ward perturbation. 

In addition to the symmetrical and asymmetrical compensation patterns, 
we identified in the sample three participants who were not able to con- 
sistently compensate for the opposite perturbation directions throughout 
the experiment (see participants 5, 6, and 7 in Figure 4). In summary, 
all 18 participants who participated in the study exhibited one of the 
three described adaptation behaviors. Representative data of only nine 
participants is depicted in Figure 4 due to space limitations. 
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As revealed by Figure 4, individual adaptation patterns exhibited a 
lot of spatial and temporal non-linearities. This fact makes it prohibi- 
tive to apply plain pairwise comparisons between participants’ produc- 
tion during the baseline and the last perturbation phase to assess whether 
speakers have successfully adapted to the perturbation. In the worst case, 
this approach risks obfuscating the specific characteristics of the adap- 
tation patterns. Just on grounds of such comparison, participants 5, 6, 
and 7 would qualify as speakers who failed to compensate in opposite 
directions. However, examining the evolution of their F2 responses over 
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Figure 4: Individual compensatory effects in F2 for downward and upward 
perturbation across all experimental trials. The F2 curves were estimated by the 
same GAMM model which is depicted in Figure 3. Please note: individual y-axis 
scales were applied due to big inter-individual differences of the compensatory 
magnitude. Vertical dashed lines denote the beginnings and the ends of the 
experimental phases. After the baseline phase (Base), the perturbation magnitude 
amounted to 220 Hz (Shift 1), 370 Hz (Shift 2), and 520 Hz (Shift 3). 
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the course of the experiment it is apparent that all three participants tried 
to compensate for the applied shifts with participant 7 eventually being 
able to achieve this goal for the upward but not the downward perturba- 
tion direction. 

Participant 6, for instance, initially increased F2 in both experimental 
syllables independently of the perturbation direction. After the second per- 
turbation phase, the produced F2 frequency started to drift again into the 
negative direction but remained, nonetheless, distinct for both syllables. 
This pattern provides evidence for the fact that, although she was not 
counteracting the applied perturbation, participant 6 was able to perceive 
the auditory errors caused by the downward and upward perturbations 
and to differentiate between them. Participant 5, on the other hand, differ- 
entiated between the two perturbation directions during the first perturba- 
tion phase, but changed her compensatory movements for the downward 
shifts during the second perturbation phase such that she produced the 
same F2 frequency for both syllables at the end of the experiment. 

Analogously to participants 5 and 6, participant 7 was not able to 
develop an appropriate compensation strategy when the perturbation 
was first applied. However, she changed her initial incorrect strategy in 
the course of the experiment. She started to counteract the perturbation 
during the second perturbation phase and eventually developed two con- 
sistently different production strategies by the end of the experiment. 

The relative compensation magnitude varied substantially across all 
participants independently of whether they could successfully compen- 
sate for both perturbation directions or not. During the last perturba- 
tion phase, for instance, the compensation magnitude fluctuated between 
6.6 and 103 % across all participants. Also, the amount of change of 
the compensation magnitude over the course of the experiment was not 
identical among the participants. Compare the adaptation patterns of 
participants 9 and 13 in Figure 4. While for participant 9 the compen- 
sation magnitude increased with the increasing perturbation magnitude, 
the compensation magnitude in participant 13 appears to have reached 
an absolute compensation limit for both perturbation directions around 
100 Hz. Overall, there was a weak negative correlation between the 
average compensation magnitude and the perturbation magnitude (r = 
-0.19, t=-10.31, p<.05, 95 % CI [-0.23 -0.16]). This suggests that the 
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average compensation magnitude slightly decreased as the perturbation 
magnitude increased. 


3.3. Role of the initial F1-F2 space 


To explain the occurrence of the symmetrical and asymmetrical compen- 
satory patterns, we investigated the influence of participants’ initial F1-F2 
space on their compensatory performance. 

The mean F1 and F2 frequencies produced by all participants during 
the baseline phase are summarized in Figure 5 split by participants’ gender. 

There were no statistically significant within-speaker differences in F1 
between the vowels of the four syllables. The average F1 difference between 
/di/ and /di/ in female participants with asymmetrical compensatory pat- 
tern (7 participants) was 19.27 Hz (t = 1.96, p > .05, 95 % CI [93.78 
32.63]), 4.17 Hz (t = 1.87, p > .05, 95 % CI [-0.13 7.3]) between /gi/ and 
/di/, and 3.73 Hz (t = 0.44, p>.05, 95 % CI [-11.72 19.18]) between /gu/ 
and /gi/. 

In female participants with symmetrical compensatory pattern (4 
participants), the average F1 frequencies were lower for every syllable. 
However, none of these differences was significant (/di/: -2.53 Hz, t= -0.16, 
p > .05, 95 % CI [-31.47 26.42]; /di/: -3.56 Hz, t = -0.17, p > .05, 95 % 
CI [-41.2 34.08]; /gi/: -2.89 Hz, t= -0.14, p > .05, 95 % CI [-41.31 35.54]; 
/gu/: -16.49 Hz, t = -1.17, p > .05, 95 % CI [-42.21 9.22]). In female 
participants who reacted inconsistently to the opposite perturbations (3 
participants), the average F1 frequencies were also lower for every syllable 
compared to the female participants with the asymmetrical compensation 
pattern. Again, none of these differences was significant (/di/: -9.8 Hz, t = 
-0.5, p > .05, 95 % CI [-45.61 26.0]; /di/: -23.22 Hz, t = -0.91, p > .05, 
95 % CI [-69.78 23.34]; /gi/: -23.43 Hz, t= -0.9, p > .05, 95 % CI [-70.97 
24.11]; /gu/: -30.53 Hz, t= -1.75, p > .05, 95 % CI [-62.34 1.28]). 

The F2 model indicated significant within-speaker differences between 
F2 values of the investigated vowels. In female participants with asym- 
metrical compensatory pattern (7 participants) the average F2 difference 
between /di/ and /di/ was -302.96 Hz (t =-6.3, p < .05, 95 % CI [-390.76 
-215.13]), -169.52 Hz (t =-5.26, p < .05, 95 % CI [-228.29 -110.76]) 
between /gi/ and /di/, and -1298.46 Hz (t =-19.08, p < .05, 95 % CI 
[-1422.68 -1174.34]) between /gu/ and /gi/. 
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In female participants with symmetrical compensatory pattern (4 
participants), the average F2 frequencies were lower for every syllable except 
for /di/. However, none of these differences was significant (/di/: 33.61 Hz, 
t = 0.47, p > .05, 95 % CI [-95.68 162.89]; /di/: -6.53 Hz, t= -0.07, p > .05, 
95 % CI [-188.9 175.8]; /gi/: -15.74 Hz, t= -0.16, p > .05, 95 % CI [-194.0 
162.54]; /gu/: -5.24 Hz, t = -0.18, p > .05, 95 % CI [-57.98 47.51]). In female 
participants who reacted inconsistently to the opposite perturbations (3 
participants), the average F2 frequencies were lower for every syllable except 
for /di/ and /gu/ compared to the female participants with the asymmetrical 
compensation pattern; only the difference for the syllable /gu/ was significant 
(83.23 Hz, t = 2.33, p < .05, 95 % CI [18.0 148.48]). The remaining three 
differences were not significant (/di/: -11.31 Hz, t = -0.13, p > .05, 95 % CI 
[-171.23 148.63]; /di/: 132.12 Hz, t= 1.07, p > .05, 95 % CI [-93.57 357.61]; 
/gi/: -107.42 Hz, t= -0.89, p > .05, 95 % CI [-327.84 113.42]). 

The F1 and F2 frequencies produced by male participants were on 
average lower for every syllable compared to the formants produced by 
female participants, however, these differences were only significant for F2 
values produced for the syllable /di/ (-270.41 Hz, t = -3.58, p < .05, 95 % 
CI [-408.26 -132.55]). 

Overall, the observed F1-F2 space of the vowels /i/, /i/, and /u/ was 
consistent with previous descriptive studies of Russian vowels (Lobanov, 
1971; Bolla, 1981). As expected, there was no statistically significant and 
no perceivable difference in F1 between the investigated vowels (previous 
research on formant perception indicates that on average participants do 
no perceive F1 differences below 50 Hz; Oglesbee & Kewley-Port, 2009). 
The vowels were differentiated most prominently by F2 with /i/ having the 
highest and /u/ the lowest values; F2 of /i/ lay between the other two vowel 
categories. F2 was higher in /di/ compared to /gi/ likely due to coarticula- 
tion. Furthermore, the initial F1-F2 vowel space did not significantly differ 
between the three participant groups which exhibited different compensa- 
tory patterns during the perturbation phases of the experiment. 


4. Discussion 


In the current investigation we presented results from a bidirectional audi- 
tory perturbation experiment conducted with native speakers of Russian. 
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Figure 5: The average F1-F2 vowel space produced by all participants during the 
baseline phase (no perturbation) for the four syllables /di/, /di/, /gi/, and /gu/. The 
data is split by participants’ gender. 


During three perturbation phases of the experiment, participants had to 
produce the close central unrounded vowel /i/ while its F2 frequency was 
perturbed in opposing directions depending on the preceding consonant 
(/d/ or /g/). The bidirectional perturbation was intended to increase the 
demands associated with the experimental task since participants had to 
coordinate their corrective movements in two different ways depending 
on the perturbation direction to produce the target vowel /i/. Based on 
the recurrent observation that participants counteract the applied auditory 
perturbation, we expected that the baseline F2 values for the two syllables 
/di/ and /gi/ would diverge over the course of the three perturbation phases 
since the magnitude of the perturbation increased in opposing directions 
from one perturbation phase to another. The two consonantal contexts 
(alveolar vs. velar) were chosen to evaluate the potential influence of phys- 
ical restrictions on the success of the adaptation outcome. 

The average adaptation behavior observed during the study confirmed 
our main hypothesis. The GAMM model estimated for the normalized F2 
frequency suggested that participants were able to adapt simultaneously to 
two opposing F2 perturbations and employ different strategies to produce 
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the vowel /i/ depending on the direction of the applied perturbation. These 
results are qualitatively in line with previous articulatory and auditory 
perturbation studies which show that most speakers are able to remap 
their initial articulatory-to-acoustics mapping under aggravated speech 
conditions (e.g., Gay et al., 1981; Savariaux et al., 1995; Feng et al., 2011). 

Furthermore, our results are consistent with findings by Rochet-Capellan 
& Ostry (2011) who demonstrated that participants are able to simul- 
taneously develop multiple strategies to produce the same target vowel. 
Adding to these results, our data shows that the results obtained by Rochet- 
Capellan & Ostry (2011) in the context of F1 are generalizable to F2. 

The compensatory effects observed in F2 frequencies for both perturba- 
tion directions were absent in our F1 data. This result serves as evidence 
for the validity of the applied experimental manipulations. 

The application of generalized additive mixed modelling (GAMM) 
allowed us to investigate the evolution of the adaptation process over 
time. Particularly, we were able to observe participant-specific differences 
in the spatial and temporal dimensions of the compensatory changes and 
to understand individual differences in how participants were generally 
able to cope with the demands of the experimental task. 

While one group of participants was almost immediately able to 
compensate for bidirectional formant perturbations during the first per- 
turbation phase, other participants needed longer periods of time to 
do so and started to compensate only in the second or the third per- 
turbation phase. Also, a few participants failed to identify the appro- 
priate compensatory adjustments altogether. Since these speakers also 
tended to change the initial direction of their compensatory movements 
throughout the experiment, their behavior can be best described as 
exploratory. In several instances, the directional changes of the compen- 
satory movements were quite abrupt as revealed by the non-linearities 
of the modelled F2 curves. 

Although the required corrective tongue movements were set along the 
same movement trajectory (forward vs. backward), participants had to 
figure out two quite different strategies to produce the vowel /i/. In partic- 
ular, they first had to identify the direction of the applied frequency shift 
for both experimental syllables and then to adjust their tongue movements 
appropriately. 
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It appears plausible that the correct identification of the F2 perturbation 
direction was not a trivial task since it took different amounts of time for 
different participants before they eventually started to consistently com- 
pensate for the applied perturbations if at all. Most convincing in this 
regard is the observation that some participants initially followed the per- 
turbation but started to counteract it after a while. These observations 
suggest that all participants without exception were indeed perceiving the 
auditory errors caused by the perturbations, however, not all were able to 
figure out the appropriate articulatory adjustments in order to minimize 
them. One potential reason for this might be their inability to identify the 
correct perturbation direction. 

This hypothesis can also explain the observation that in typical (uni- 
directional) auditory perturbation studies, beside a group of participants 
who counteract the perturbation, there are most likely a few participants 
who appear to follow the perturbation. Our data suggest that both 
reactions to auditory perturbations (counteracting and following) can be 
understood in more general terms as exploratory compensatory behavior 
where participants wander through the formant space in order to find 
the appropriate corrective movements to produce the intended acoustic 
output. In line with this idea is the observation from the current study 
that no participant actually followed the applied perturbations in both 
directions (upward and downward). 

The observed temporal non-linearities and abrupt directional changes 
of the compensatory responses challenge the idea that speakers exhibit 
either auditory or somatosensory feedback preference during speech pro- 
duction (Lametti et al., 2012). Strictly following the idea of feedback pref- 
erence, speakers with auditory feedback preference should have always 
reacted consistently to the applied auditory perturbations, i.e., indepen- 
dently of the perturbation direction. At the same time, we should expect 
that a subset of speakers with a preference for somatosensory feedback 
should virtually ignore the auditory perturbations. However, the examina- 
tion of individual adaptation patterns revealed that both assumptions do 
not appear to be true. 

First, among the 18 participants there were no speakers who ignored 
the applied perturbations, which might have suggested that they dis- 
prefer the auditory feedback channel during speech production. Secondly, 
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and more importantly, for several participants the direction and the 
magnitude of the compensation were not identical for both perturba- 
tion directions, which it should be under the assumption that a speaker 
exhibits a permanent preference for the auditory feedback channel. 
Furthermore, several participants were able to acquire the two appro- 
priate compensatory strategies after some practice. That is, the ability 
to develop a consistent compensatory strategy does not seem to depend 
on speakers’ preference for auditory or somatosensory feedback. The 
observation that the compensation magnitude was not identical for the 
two opposite perturbation directions across all participants deserves fur- 
ther attention. The estimated GAMM model suggested that although 17 
participants compensated for the upward perturbation, only five of them 
did it simultaneously for the downward perturbation. (Additionally, a 
single participant significantly shifted her F2 frequency upwards indepen- 
dently of the applied perturbation direction.) We dubbed this difference 
in compensatory profiles as asymmetrical and symmetrical compensation 
patterns. 

One potential explanation for the observation that far less participants 
were able to compensate for the downward perturbation is the articula- 
tory effort associated with the required forward compensatory movement 
of the tongue. Whereas the backward movement of the tongue from the 
central position of the /#/ is physically less restricted, the extent of the for- 
ward movement is limited by the alveolar ridge and the upper incisors. 
Since we do not have articulatory data of participants’ palatal shapes, this 
possibility cannot be ruled out completely. However, there is some evi- 
dence which undermines this hypothesis. 

Taking the introduced idea of physical restrictions further, we have to 
assume that the forward movement of the tongue in /di/ should be even 
more restricted compared to /gi/ as the tongue has already a more advanced 
position in the first syllable. This positional difference between /di/ and /gi/ 
is supported by the results on participants’ initial F1-F2 formant space 
presented in the section 3.3. Based on this fact, we should expect that the 
participants who were able to compensate for the downward perturbation 
were able to do so preferably for the syllable /gi/. However, from the six 
participants who significantly upshifted their F2 during the perturbation 
phases, three did it for /di/ and the other three did it for /gi/. That means 
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that there was no advantage for the syllable /gi/ as there ought to be as- 
suming that the distance from the produced vowel to the physical limit 
(i.e., alveolar ridge/upper incisors) played a crucial role for the success of 
the forward compensatory tongue movement. This interpretation is fur- 
ther supported by the GAMM modelling as the inclusion of the interaction 
between the perturbation direction and the syllable did not improve the 
overall fit. 

The idea of the influential role of physical constraints on the compen- 
satory movement can be restated in more abstract terms of somatosen- 
sory categories. In these notions, the upper limit of the compensatory 
movement is no longer assumed to be a physical boundary (i.e., alveolar 
ridge) but rather a somatosensory category boundary of the neighboring 
speech sound. In the particular case discussed in the above paragraph, 
both explanations can be used interchangeably without inducing different 
interpretations of the results. However, the reformulation of this hypothesis 
in somatosensory terms allows us to evaluate the potential influence of the 
somatosensory sound categories on the compensation magnitude in the 
case of the backward tongue movements since these were generally physi- 
cally less restricted in both cases of /di/ and /gi/. 

At first glance, it is conceivable that the higher distance from /i/ 
towards the somatosensory boundary of /u/ compared to the distance 
between /i/ and /i/ facilitated the compensation for the upward perturba- 
tion. However, as pointed out by Katseff et al. (2012), if we assume that 
speech sounds are also defined in sensorimotor space, compensatory mag- 
nitude should be restricted not only by a neighboring speech sound but 
foremost by the size of the sensorimotor region of the perturbed category. 
That means that when speakers deviate too much from the sensorimotor 
region of the perturbed category, the magnitude of the compensation 
should decrease. This hypothesis is, however, substantially challenged by 
the experimental data. 

Comparing the compensation magnitudes for the upward perturba- 
tion between /di/ and /gi/ reveals that participants compensated in both 
syllables in comparable amounts despite the fact that the somatosen- 
sory distance between /di/ and /gu/ was higher compared to /gi/ and /gu/. 
The absence of a difference in the compensation magnitude between 
/di/ and /gi/ was supported by the GAMM modelling as the inclusion of 
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the interaction between the perturbation direction and the syllable did 
not improve the overall fit. Furthermore, the examination of individual 
adaptation patterns revealed that a subset of speakers was able to com- 
pensate 100 % of the applied perturbation even when it reached 520 Hz 
and thereby induced a high degree of somatosensory error. Speaking to 
the same issue, for some participants who changed their F2 frequency in 
the same direction as the applied perturbation, the mismatch between the 
auditory and somatosensory error grew ever higher over the course of the 
experiment. 

Taking all this evidence into account, the emergence of the asymmetric 
compensatory pattern is difficult to explain in terms of the violation of 
somatosensory boundaries. Consistent with this idea is the fact that there 
were no systematic differences in participants’ initial F1-F2 formant spaces 
which could predict their different compensatory profiles (symmetrical, 
asymmetrical, and non-consistent). 

Without resorting to the somatosensory boundaries, we can think of one 
alternative explanation for the emergence of the asymmetric compensatory 
pattern. Central to this hypothesis is the idea that the asymmetric pattern 
emerged due to an asymmetry in the phonemic space of the Russian high 
vowels. In particular, while /i/ appears only after palatalized consonants in 
Russian, both /#/ and /u/ follow only non-palatalized ones (cf. Bolla, 1981). 
The palatalization contrast is an important part of the Russian phonology 
and is very present for Russian speakers. The most common acoustic fea- 
ture associated with palatalized consonants is the high F2 frequency at the 
beginning of the following vowel. This acoustic feature is so important for 
the perception of palatalization by Russian speakers that even cross-spliced 
syllables containing non-palatalized consonants and vowels with high ini- 
tial F2 frequency are perceived as palatalized (cf. Bondarko, 2005). We 
think that this perceptual effect might have occurred during our experiment. 

Since the baseline F2 values of /i/ and /i/ are on average substantially 
closer to each other compared to /#/ and /u/, it seems reasonable that most 
participants classified instances of /i/ shifted towards /i/ as phonemic errors 
of palatalization and corrected for them by lowering their F2. On the other 
hand, only a few participants reacted to the F2 perturbation of /i/ towards 
/u/ as it did not induce a change of palatalization status of the perceived 
syllable. Presumably, those participants who reacted by the same amount 
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to the downward perturbation as to the upward perturbation were more 
sensitive to general F2 changes independent of the phonemic status of the 
perceived syllable. Unfortunately, we do not have participants’ perceptual 
profiles which could settle this question completely. 


5. Conclusion 


Despite a growing body of research, the factors which induce the inter- 
individual outcome variability during sensorimotor learning are still much 
debated. Our investigation has shown that there is merit in varying task 
parameters within the same experimental session and in analyzing the data 
of perturbation experiments taking the temporal dimension of the adap- 
tation process into account. By doing this, we could show that the inter- 
and intra-participant variability present during somatosensory learning in 
speech is beyond the predictions of the hypotheses which ascribe this var- 
iability exclusively to the characteristics of speakers’ internal models of 
speech motor control. 
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Aetiology of speech sound errors in autism 


Abstract: In looking at speech perception and production, it is vital we understand 
variation in different populations in order to understand variation in what is per- 
ceived as typical speech development; develop bio-markers; and provide effective 
methods for diagnosis and intervention where required. Research suggests that 
people with autism experience higher rates of speech sound errors (SSEs) than 
their peers (Cleland, Gibbon, Peppé, O’Hare, & Rutherford, 2010; Shriberg, Paul, 
Black, & Santen, 2011), yet the reasons why are unknown. This chapter takes an 
in-depth look at the current literature on SSEs produced by people with autism, 
from young children to young adults. It explores why these higher rates occur, 
moving beyond the previous debate of whether they exist at all in this population. 
Recent studies using detailed analyses show that children with autism exhibited 
significantly higher rates of SSEs than typically developing (TD) children, these are 
discussed in detail alongside a critique of the methods historically used to assess 
SSEs in this population. This chapter proposes two perspectives that may account 
for these higher rates of SSEs in autism: a) the speech attunement framework and 
b) deficits in speech motor control. It explores how both of these perspectives may 
intersect to produce SSEs in people with autism. Both are discussed in relation to the 
comorbidities of speech perception issues and motor deficits often found in people 
with autism. Suggestions are made for future research using sensitive articulatory 
analysis of speech such as ultrasound tongue imaging or electropalatography. This 
chapter highlights the need to look equally at both linguistic and motor skills in 
children with autism to describe accurately the range of cognitive and neurophysi- 
ological processes that may affect speech production. 


Keywords: autism, speech errors, ultrasound, electropalatography, speech attune- 
ment framework, speech motor impairment 


1. Introduction 


People with autism present with higher rates of speech sound errors (SSEs) 
than their peers (Cleland, Gibbon, Peppé, O’Hare, & Rutherford, 2010; 
Shriberg, Paul, Black, & Santen, 2011) yet the reasons why are unknown. 
While SSEs and related disturbances to speech prosody might be a salient 
feature of autism on the first encounter, most research has focussed on the 
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(arguably) more serious nature of social problems in autism. The small 
amount of research that does exist on SSEs is heterogenic, similar to the pre- 
sentation of the condition itself. Early studies suggest people with autism 
have normal speech development or that it is simply delayed (Kjelgaard 
& Tager-Flusberg, 2001; McCleery, Tully, Slevc, & Schreibman, 2006). 
However, recent work using sensitive statistical measurements showed 
that children with autism exhibited significantly higher rates of SSEs than 
typically developing (TD) children. Moreover, researchers have identified 
disordered speech development when phonetic and phonological analyses 
go beyond percentage consonants correct measures (Cleland et al., 2010; 
Shriberg et al., 2011; Wolk & Brennan, 2013). 

The cause of SSEs in autism has not been fully explored in the literature. 
Up to this stage, researchers have focused on testing whether or not SSEs 
are a feature of autism, without exploring why these may occur. Now that 
recent evidence demonstrates SSEs are prevalent in autism, this chapter 
will examine their possible causes. There are currently two perspectives 
on why SSEs occur in children with autism: (1) the speech attunement 
framework first described by Shriberg et al. (2011) and (2) the speech 
motor impairment theory set out by Belmonte et al. (2013). These two 
perspectives will be discussed in the context of autism and results from the 
literature will be explored regarding how each framework might intersect. 


2. Speech sound errors 


Before exploring why SSEs are present in autism, it is important to under- 
stand what is expected of normal speech production, what speech sounds 
errors (SSEs) are and when they occur. Speech production and perception 
can breakdown at multiple levels, reducing the effectiveness of the final 
goal of fluent speech (Ferrand, 2014). The neural processing required for 
speech production and perception is still only partially understood (Baghai- 
Ravary & Beet, 2013). Speech production has been characterized as one 
of the most complex motor skills, functioning as multiple subsystems 
that must effectively coordinate together (Duffy, 2000). For example, the 
phonatory system, which consists of the laryngeal muscles, vocal folds 
etc., must work in a coordinated manner to achieve effective voice pro- 
duction. Likewise, the phonatory system must also be coordinated with 
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other sub-systems (e.g. respiratory system). Speech perception relies on 
the auditory system in which acoustic signals are transformed into mean- 
ingful representation of spoken language (Gandour & Krishnan, 2016). It 
requires various complex perceptual and cognitive tasks along the audi- 
tory pathway. Motor speech representations are important for both per- 
ception and production (Ravizza, 2005). 

We use the term “Speech Sound Error (SSE)” here to describe dif- 
ficulties with the production of speech sounds or speech segments 
(American Speech-Language-Hearing Association, 2017). Common clin- 
ical distortions or residual SSEs, such as rhotic or sibilant distortions are 
relatively minor and generally not associated with language or intelli- 
gibility deficits (Shriberg et al., 2011). In contrast, other children may 
have speech which is unintelligible even to close family members. SSEs 
are common in early childhood and include articulation errors (motor- 
based production deficits) and phonological errors (knowledge and use 
of speech sounds) (Eadie et al., 2015). Articulation errors often come in 
the form of distortions whereas phonological errors come in the form 
of substitutions and deletions such as consonant cluster reduction, final 
consonant deletion, velar fronting and stopping of fricatives. Articulation 
and phonological errors are not mutually exclusive and both of them can 
occur in a child’s speech profile. 

Problems start to arise when speech errors are not resolved during 
childhood and can then be described as either residual speech sound 
errors (RSSEs) or persistent speech errors. Residual speech errors arise 
as “leftovers” from an earlier speech delay (omission or substitution 
errors) that migrated closer to the norm to become distortions, whereas 
persistent speech errors are distortions that have been habituated from 
an early age. Residual speech sound errors often affect late acquired 
and motorically complex speech sounds such as /s/ and /r/ and are 
manifested as common clinical distortions of these sounds, for example 
lateralised /s/ or labiodentalised /r/. These types of errors might be par- 
ticularly common in people with autism (33 % of verbal adolescents 
and adults with autism compared to just 1-2 % of the typical popula- 
tion, Shirberg et al., 2001). Why this is the case is not known, though 
Shriberg and colleagues ascribe it to a difficulty in fine-tuning to the 
ambient speech model. 
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3. Aetiology of speech sound errors and autism 


Autism is a neurodevelopmental disorder in which there is a frequent 
co-occurrence of verbal and non-verbal deficits (American Psychiatric 
Association, 2013). People with autism are known to have persistent 
deficits in social behaviour, communication, and language, which may be 
entwined with their difficulties in producing intelligible speech. Evidence on 
speech impairment in autism is heterogenic. Some researchers have found 
it is disrupted in children with autism while others have found speech to be 
either delayed or developmentally appropriate when using perceptual and 
behavioural checklist assessments (Bartolucci & Pierce, 1977; Kjelgaard 
& Tager-Flusberg, 2001; Wilkinson, 1998). The literature currently lacks 
organisation of theoretical concepts, with different studies relying on dif- 
ferent methods to measure SSEs. 

Interactions between different areas of impairment in autism may cause 
SSEs. A triad of symptoms associated with autism could impair speech 
development: social motivation, cognitive (and motor) control, and per- 
ceptual control. Social motivation is a set of psychological and biolog- 
ical mechanisms that biases a person to orient to the social world, seek 
social interactions and maintain social binds. In autism there appears to 
be a decrease in attention given to social information, causing a cascading 
effect on the development of social cognitive skills (Chevallier, Kohls, 
Troiani, Brodkin, & Schultz, 2012). Impaired social motivation means 
the child may miss vital communicative opportunities in which to develop 
typical speech. This encompasses cognitive rigidity, a trait associated with 
speech disorders: one study found children with consistent speech disorder 
performed worse in cognitive flexibility tasks (Crosbie, Holm, & Dodd, 
2009). A piecemeal cognition style alongside these social deficits may 
also result in autistic traits (Valla, Maendel, Ganzel, Barsky, & Belmonte, 
2013). Social motivation impairment may negatively impact the develop- 
ment of neural networks critical to social cognition, e.g. face processing 
(Sterling et al., 2008). This cognitive style is described as “piecemeal” as 
the person’s attention is either on the individual components of the face or 
on the physical configuration, losing the important social information in 
this interaction. This cognitive style may also have a significant effect on 
processing of speech, where only certain aspects of speech are attuned to, 


Aetiology of speech sound errors in autism 113 


e.g. phonological elements that are necessary for differentiating meaning 
may be given preference over phonetic aspects which signal speaker iden- 
tity. Moreover, suprasegmental aspects of speech, like prosody or pitch, go 
unnoticed, resulting in the production of unusual-sounding speech. 

Perceptual processing may also be a cornerstone in our understanding of 
autism (Baum, Stevenson, & Wallace, 2015). Sensory representations form 
the basis of higher-order cognitive representation. However, anomalies in 
sensory processing in autism are not well understood. Perceptual anomalies 
in autism may account for commonly found traits in language and speech 
impairment. People with autism have been found to have difficulty in 
social orientation to relevant auditory stimuli such as speech (Kuhl et al., 
2005; Paul, Chawarska, Fowler, Cicchetti, & Volkmar, 2007). However, 
significantly more research is required to determine how this affects social 
and cognitive development. 

A link has also been found between deficits in language, literacy and 
SSEs (Carson et al., 2003; Goffman, 1999; Hayiou-Thomas, Carroll, 
Leavett, Hulme, & Snowling, 2017; Whitehurst, Smith, Fischel, Arnold, & 
Lonigan, 1991; Williams & Elbert, 2003). Literacy and SSEs have a com- 
plex relationship, presence of speech sound disorders have been found to 
have a small but significant risk of poor phonemic skills, spelling and word 
reading. While SSEs alone only have a modest effect on literacy develop- 
ment, when it is part of additional risk factors such as language delay, 
these can have serious negative consequences. This is consistent with the 
findings that multiple risks such as SSEs and language delay/disorder can 
accumulate to predict reading disorder (Hayiou-Thomas et al., 2017). 

Previously the primary focus of research in communication in chil- 
dren with autism has been language, prosody and behavioural difficul- 
ties (Kjelgaard & Tager-Flusberg, 2001; Owens, 2004; Paul & Norbury, 
2012). The presence of speech sound difficulties is now being acknowl- 
edged (Cleland et al., 2010; Shriberg et al., 2011; Wolk, Edwards, & 
Brennan, 2016) but questions remain on the nature of the speech sound 
errors in autism. It is suggested that children with autism may exhibit 
speech production that is characteristically different in its organisation 
from typical speech. This may be due to developmental delay, but may 
also be due to differences in the underlying psychological or neuromotor 
structures required to produce speech and to practice it regularly in 
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everyday experience with social others. Below we discuss the presentation 
of SSEs in the current literature available. 


4. Historical research of SSEs in children with autism 


Around 40 years ago, behavioural studies concluded that children 
with autism had a delayed pattern of acquisition of speech sounds sim- 
ilar to children with intellectual disability (Bartolucci & Pierce, 1977). 
“Oddities” in speech production were often described following broad 
phonetic transcriptions (Pronovost, Wakstein, & Wakstein, 1966). Due 
to the lack of in-depth instrumental analyses or narrow phonetic tran- 
scription, these errors revealed little about specific speech patterns and 
conspicuously absent was the level of detail required to identify the minor 
articulatory distortions. Speech sound production was often only assessed 
in addition to other aspects of communication impairment (e.g. recep- 
tive and expressive language, social communication deficit, prosody). The 
main purpose of the analyses in these studies were often other aspects of 
communication impairment. The analysis of speech was therefore often 
only assessed in brief using parent questionnaires or short perceptual tests 
of single words (Pronovost et al., 1966). This severely limited the detail of 
findings beyond broad diagnostic categories. 

During the last century and more recently, smaller, more in-depth case 
studies used phonological analysis to identify both delayed and disordered 
phonological processes (Wetherby, Yonclas, & Bryan, 1989; Wolk & 
Brennan, 2013; Wolk & Edwards, 1993; Wolk & Giesen, 2000). Wolk and 
Giesen (2000) carried out a phonological analysis of speech elicited with 
both object naming and spontaneous speech in four children with autism. 
They found typical but delayed phonological processes. However, they 
also identified atypical processes such as residual errors, unusual sound 
changes and chronological mismatch (where phonemes are not acquired 
in the developmentally typical order). The speech profiles contained evi- 
dence of both articulation errors and phonological errors. All children had 
a diagnosis of a phonological disorder ranging from mild to severe, with 
one child classed as non-verbal. Even within this very small sample, there 
is a huge variation, indicating that there are likely different subtypes of 
speech sound disorders in autism. One of the few studies that focused 
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exclusively on speech sound behaviour was carried out by Bartolucci and 
Pierce (1977). Using a picture-naming task (single word analysis) to as- 
sess both perception and production, they compared children with autism 
to TD children and children with intellectual disability. The analysis was 
limited. Twenty-four consonant sounds were broadly transcribed using the 
International Phonetic Alphabet and errors were compared in terms of 
percentage consonants correct. They concluded that verbal children with 
autism failed to show any atypical traits in the production or perception 
of speech sounds. Additionally, when looking at the findings in further 
detail using more sensitive phonetic analysis they found a delayed pattern 
of acquisition, similar to children with intellectual disability. Conversely, 
there was a significant difference in the percentage of errors on liquids 
made by children with autism (11.4 %) compared to children with intel- 
lectual disability (4.7 %) and TD children (0 %). Similarly, in their per- 
ception task of liquids, there was a significant difference between errors 
made by children with autism (16.6 %) compared to TD (8.6 %) and 
intellectual disability (5.5 %). These results indicate an atypical profile 
rather than delayed speech acquisition. Further analysis of the consistency 
and frequency of errors may have revealed more about the speech sound 
patterns of these different groups. These studies tell us that SSEs were 
being identified in groups of children with autism, but we require measures 
that are more sensitive in order to determine the causes. Further in-depth 
analyses are required to determine the pattern of errors produced by this 
particular group at all stages of communication development. 


5. Atypical speech in young children with autism 


Findings from multiple researchers suggest that children with autism at the 
prelinguistic stage of communication have different phonatory qualities 
than their peers and children with other developmental disorders (Schoen, 
Paul, & Chawarska, 2011). This information could contribute to early 
identification that would allow intervention to be put in place during the 
optimal period, i.e. in the first years of life, when the brain is developing 
rapidly and there is significant neural plasticity. 

Atypical speech development appears to be identifiable at an early stage 
of communication in children with autism. Smaller case studies using 
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perceptual phonological analysis identified SSEs in young children with 
autism (under 5 years). Wetherby et al. (1989) analysed the syllables of 
vocalizations made by three children with autism (under 5 years) in a 
30-minute sample of communicative behaviour. They found that the chil- 
dren had a deficient proportion of vocal acts that contained a consonant. 
This absence of some consonants in communicative acts might be an early 
warning sign that speech is not developing normally. It is difficult to dis- 
tinguish whether this is a result of phonological or articulatory issues, thus 
further analysis of the speech errors made would be required. Samples 
of communicative interactions are often used in younger cohorts to ana- 
lyse their phonological development perceptually. One interesting finding 
from this in-depth analysis is the presence of “atypical vocalizations” in 
children with autism. Atypical vocalizations were the primary aspect of 
prelinguistic communication that differentiated children at high risk for 
autism (9-12 months) from children at low risk (Schoen et al., 2011). To 
investigate this further Schoen et al. (2011) studied phonological and vocal 
behaviour using broad phonemic transcription of speech-like utterances 
and coded non-speech vocalizations without recognisable consonants. 
They found 30 toddlers (18-36 months) with autism exhibited “atypical 
vocalizations” and overall a limited number of consonants compared to 
two groups of TD children (age-matched and language- matched). Whilst 
the percentage of consonants correct was not different from their peers, 
the number of speech-like utterances produced was significantly less. The 
main area of difference between the children with autism and their peers 
was the presence of “atypical vocalizations”. These atypical vocalizations 
came mainly in the form of high-pitched squeals (Schoen et al., 2011). 
What this research might suggest is that toddlers with autism do not align 
their speech to the duration, pitch and phonotactic properties of their 
ambient language environment. 

Toddlers with autism may not tune into the language model of their 
environment (Sheinkopf, Mundy, Kimbrough Oller, & Steffens, 2000). 
Their failure to attend to their ambient language environment may nega- 
tively affect their ability to acquire spoken language, which in severe cases 
can mean people with autism remain nonverbal throughout life. In a study 
of early vocal behaviours in young children with autism (n=15) and chil- 
dren with developmental delays (n=11), the children with autism did not 
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differ in production of well-formed complex canonical vocalizations, but 
had significantly more utterances with atypical vocal quality (Sheinkopf 
et al., 2000). Canonical vocalizations are well-formed consonant-vowel 
sequences with rapid CV transitions. An impairment in these sequences 
serves as a sign of speech motor control impairment, which was not the case 
in this sample. However, the significant presence of atypical vocal quality 
may be an indicator of speech perception issues. Wallace et al. (2008) 
reanalysed this data using acoustic analysis and more refined categoriza- 
tion techniques and found that children with autism produced more atyp- 
ical phonatory qualities than children with developmental delay. On the 
contrary, Schoen et al. (2011) found that toddlers with autism followed a 
normal trajectory of phonological development, suggesting no issues with 
speech development. However, there was a significant presence of atypical 
vocalizations in their speech, which may be due to a presence of speech 
attunement issues. Descriptions of vocal profiles differentiating develop- 
mental profiles could provide valuable evidence for early biomarkers of 
autism and could help us explain the origin of the issues in speech pro- 
duction demonstrated at a later age. Further research as to why atypical 
vocalizations occurs in young children with autism is required. 

Current research of SSEs in autism has started to use technology as 
a means of increasing the sensitivity of analysis of speech, both qualita- 
tively and quantitatively. Shriberg et al. (2001) found a predominance of 
articulation errors in children with autism using the “PEPPER” software. 
This software allowed analysis of the type and frequency of consonant 
and vowel errors in conversational speech. Using this method, they found 
33 % of the cohort with autism had at least one type of speech distortion 
error (residual speech sound errors, such as lateral lisps). These may be an 
indicator of a disordered speech profile, rather than delayed speech acqui- 
sition as previously assumed in earlier research. 

An interesting finding from this study was the significant presence of 
“residual speech sound errors”. These occur when speakers older than 
nine years have two or more of the same type of residual distortion errors 
(e.g. dentalized sibilants, derhotacization). Thirty-three percent of the chil- 
dren with autism presented with residual speech errors, a significant pro- 
portion compared to the expected 1-2 % found in the TD population 
(Flipsen, 2015). Residual speech sound errors are clinically significant as 
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they involve sub-phonemic changes in articulatory place and manner and 
can persist over the individual’s lifespan. 

Further evidence of SSEs in autism was found by Cleland et al., (2010) 
who report atypical/non-developmental SSEs in children with autism. 
They carried out a phonetic and phonological analysis of speech sound 
production in 69 children with autism. Using standardized clinical percep- 
tual assessments, only 12 % of the sample received a diagnosis of speech 
delay/disorder. However, when using further in-depth phonological and 
phonetic analysis, they found 41 % of the group produced speech errors 
indicative of both speech delay and speech disorder. 

The clinical assessment of speech used was a perceptual assessment 
called the Goldman Fristoe Test of Articulation (GFTA-2; Goldman & 
Fristoe, 2000). This is one of the few standardized assessments of speech 
sound behaviours in children. It examines speech sounds in the context 
of single words. Further research beyond this assessment such as single 
words of increasing complexity (polysyllables), maximum performance 
tasks or spontaneous speech may reveal motor constraints which have a 
substantial negative impact on intelligibility or increase the likelihood of 
an SSE occurring. Cleland et al. (2010) found non-developmental speech 
errors occurred despite whether a child’s standard score fell within normal 
range or not on the GFTA (Goldman & Fristoe, 2000). This implies SSEs 
produced by people with autism may not meet the criteria for a speech 
disorder in clinical assessments. However, there is more to understand in 
relation to speech profiles of people with autism, which may reveal infor- 
mation about the different subtypes within autism and whether this aligns 
with a particular speech profile. 

The study by Cleland et al. (2010) is in agreement with previous findings 
by Kjelgaard & Tager-Flusberg (2001) and Rapin, Dunn, Allen, Stevens, 
& Fein (2009) that children with autism make a number of SSEs. Cleland 
et al. (2010) found in their sample that while speech was characterised 
by developmental phonological errors (gliding, cluster reduction and final 
consonant deletion), non-developmental errors, indicative of a speech 
disorder, were also present (e.g. phoneme specific nasal emission and initial 
consonant deletion). To understand why there may be SSEs in the group, 
Cleland et al. (2010) carried out a battery of standardized assessments in 
speech, language and non-verbal cognition to determine if there are any 
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causal links. Interestingly no relationship between speech and language or 
speech and cognition was identified in this group. This indicates SSEs may 
be a result of another impeding factor. Cleland et al. (2010) hypothesised 
that the increase of SSEs may be due to an underlying neuromotor diffi- 
culty. Additionally, it could also be due to speech attunement difficulties. 
Further analysis of auditory perceptual abilities and speech motor abilities 
is needed to understand the origin of the SSEs in this group. 

Wolk and Giesen (2000) carried out a phonetic inventory and process 
analysis and found in four siblings with autism speech processes indica- 
tive of delayed speech development. In addition, they identified atypical 
processes such as residual articulation errors, unusual sound changes and 
chronological mismatch in their speech profiles. All four children were 
significantly delayed in gross motor and fine motor abilities. The combina- 
tion of residual articulation errors (indicative of motor issues) and unusual 
sound changes (indicative of perceptual issues) suggests these children ap- 
pear to have a combination of both speech motor control and speech per- 
ception issues. However, they did not find differences in suprasegmental 
production; children with autism did not produce vocalizations different 
in fundamental frequency or duration from TD peers, suggesting they are 
able to tune in to their ambient environment effectively in some ways. 
These children may be a different subtype of autism. These are limited 
measures of suprasegmental ability and would require further analysis. 


6. Methodological issues of measurement of SSEs 


Multiple methodological issues need to be taken into account when 
assessing SSEs. Firstly, analysis of speech in single-word contexts may 
be ineffective. It does not examine the effect of complex articulatory 
gestures during spontaneous speech, which is significantly more motori- 
cally complex than single word production (Adams, 1998). Kjelgaard and 
Tager-Flusberg (2001) investigated language and speech production in 
eighty-nine children (4;0-14;0 years) with autism. They argued whilst there 
was significant heterogeneity in the children’s language skills, their artic- 
ulation skills were relatively spared. However, this conclusion is brought 
into question when noting they also used the Goldman Fristoe Test of 
Articulation. It required further phonetic and phonological analysis for 


120 Louise McKeever et al. 


Cleland et al. (2010) to identify speech distortions using this assessment 
alone which was not carried out in this study. Therefore, Kjelgaard and 
Tager-Flusberg’s (2001) assessment of speech sounds may have been inad- 
equate to determine whether their sample of children had SSEs. Clinically 
the children may not have met the diagnosis for speech sound disorder, 
but there is value in understanding if SSEs occur in order to gain under- 
standing on speech perception and production in autism. 

Perceptual single-word assessments helped identify irregularities in 
speech sound production in some studies. Rapin et al. (2009) used a single- 
word assessment, the Photo Articulation Test (Lippke, Dickey, Selmar, & 
Soder, 1997) to analyse the speech of 62 children with autism. The test 
yields a score for correct speech sounds produced in naming single word 
objects. Similar to Cleland et al. (2010) and Shriberg et al. (2011) they 
found that 28 % of the participants’ speech was characterized by persis- 
tently and severely impaired speech sound production; this was despite 
better language comprehension. Additionally, they analysed the sponta- 
neous speech samples and concluded that “several minutes of conversation 
provides more opportunities for mispronunciations than the single words 
of the Photo Articulation Test (Rapin et al., 2009). Their assessment was 
in agreement with the Photo Articulation Test results, finding 28 % of the 
speech sample was characterized by severely impaired expressive phono- 
logic skills. However, this was not an in-depth analysis, the authors rated 
each child’s speech on a 3-point scale (0= normal to 2= severe impairment). 
Whilst both these results indicate an abnormality in speech production of 
some children with autism, again it does not go beyond a quick perceptual 
analysis. One reason that (Wolk & Giesen, 2000) may have identified SSEs 
whereas McCleery et al. (2006) and Kjelgaard and Tager-Flusberg (2001) 
did not, is that they elicited speech using two methods: object naming and 
spontaneous speech utterances. Additionally, they did not rely on percep- 
tual standardized assessments. It is vital that researchers consider speech in 
multiple contexts to ensure subtle articulation errors are identified. 

The use of ineffective standardised assessments, and issues of the 
nature of autism can cause difficulty in speech sound assessment (Macrae, 
2017). For instance, in children with autism with severe language 
impairment, there is often difficulty obtaining a speech sample due to 
expressive language difficulties associated with autism. McCleery et al. 
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(2006) investigated the consonant production of 14 severely language 
delayed children with autism and 10 TD children. To assess speech in 
the context of severe language delay, their assessment involved a com- 
municative inventory providing opportunities for the child to produce 
voiced and voiceless consonant sounds. All vocalisations, including 
babbling, were scored in an effort to determine the child’s consonant 
production repertoire. McCleery et al. (2006) concluded that the chil- 
dren with autism showed the same general speech sound production pat- 
tern as TD and language-learning impaired children. Interestingly, they 
acknowledged that the children with autism produced more sounds that 
were not classified as developmentally normal but did not carry out fur- 
ther analysis on these errors. Transcription and counting of these errors 
may have revealed an alternative speech pattern in children with autism 
or clinically significant errors. Furthermore, analysis of the “abnor- 
malities” may have provided indicators of whether the nature of these 
errors were motoric or phonological, similar to studies of early commu- 
nication behaviour (Paul, Fuerst, Ramsay, Chawarska, & Klin, 2011; 
Schoen et al., 2011). 

Previous studies investigating speech of children with autism may have 
misidentified subtle articulation errors because of the imprecise nature of 
perceptual speech assessment. The assessments described in studies using 
single word analysis are reliant on these perceptual measures of speech. 
Precise information about articulatory movements cannot be identified 
from perceptual analysis alone. Yet this information may reveal more 
about speech motor control in autism and whether alternative move- 
ment strategies exist due to motor impairment and/or attunement issues. 
Auditory perceptual judgements are susceptible to errors and bias of the 
listener (Kent, 1996). An example of this is listener normalization where 
the listener mistakenly recognises phonemes that were not produced by the 
speaker. Even if errors are identified, suitable transcription techniques are 
lacking in the ability to distinguish these errors (Kent, 1996). Broad pho- 
netic transcription is reliant on the categories of the IPA chart, even though 
variation within each category can vary significantly across individuals 
(Mowrey & MacKay, 1990). Speech assessment needs to look at a speech 
in multiple contexts, where articulatory gestures are more complex and 
using more in-depth forms of phonetic and phonological analysis. This 
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will help to determine whether SSEs in autism are due to speech perception 
difficulties and/or tuning into speech or to speech motor control issues. 


7. Potential causes of SSEs in autism 


Several research groups have reported that around a third of children with 
autism present with oral motor or speech sound abnormalities at various 
levels of severity (Belmonte et al., 2013; Cleland et al., 2010; Shriberg et al., 
2011). We will now discuss two potentially complementary perspectives of 
why this may be the case. 

The “speech attunement framework” was originally developed to 
explain common speech errors in the otherwise typically developing 
populations (e.g. dentalized sibilants). Shriberg et al. (2011) developed 
the “speech attunement framework” due to ongoing suggestions that 
impairments in gross motor, fine motor and oral motor control in people 
with autism were associated with the speech deficits frequently exhibited. 
The speech attunement framework posits that a child learning speech 
needs to attend to their ambient environment or ‘tune in’ to models in 
that environment. For example, young children adopt dialect features 
of their peers by tuning in. In addition, they need to make small and 
careful adjustments to their speech production to ‘tune up’ for accurate 
and socially acceptable speech production (Shriberg et al., 2011). It is 
also important at this stage to have a maturing speech motor system that 
ensures adjustments made to speech can be done so with adequate control 
by the child. A difficulty with speech motor control could intersect with 
speech attunement and cause the heterogenic speech profiles identified in 
speakers with autism. 

As discussed earlier, people with autism may have a reduced ability 
and/or motivation to focus on the subtle details of articulation, due to 
social motivation impairment. This prevents them from making minute 
adjustments in order to produce speech similar to their social partners and 
others in their ambient environment (Shriberg et al., 2011). In essence, 
children with autism are thought not to have the psychological conditions 
necessary to engage socially with others through language to give the nec- 
essary experiences for learning speech. Speech attunement may be affected 
in people with autism by various combinations of the following conditions: 
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a) Enhanced auditory capacity, often observed in people with autism 
(Baum et al., 2015) may lead to earlier “tuning in” when motor matu- 
rity has not been achieved. Therefore, SSEs develop due to motor 
constraints. 

b) Constraints in affective social reciprocity, a common trait of people 
with autism (Chevallier et al., 2012) may delay “tuning in” and any 
motor speech disorder present may impair the ability to tune up. 


Shriberg et al. (2011) investigated whether children with autism had 
‘speech attunement’ issues and a comorbid speech motor disorder, spe- 
cifically childhood apraxia of speech (CAS). CAS impairs the precision 
and consistency of speech movements, despite the lack of any neuromus- 
cular deficits. To determine whether the increased presence of SSEs in 
children with autism was a result of speech attunement issues or CAS, 
Shriberg et al. (2011) examined the continuous speech of 40 children with 
autism; 40 TD children; 13 children with speech delay; and 15 individuals 
with CAS. They used software PEPPER (Programs to Examine Phonetic 
and Phonological Evaluation Records; Shriberg et al., 2001). They used 
this software to perceptually and acoustically analyse continuous speech 
samples, transcribing and prosody-coding subsets of the speech. 

This detailed analysis was designed to identify specific signs of CAS 
or motor speech disorder, e.g. slow speaking and articulation rate, spa- 
tiotemporal vowel errors and distorted consonants (American Speech- 
Language-Hearing Association (ASHA), 2007; Aziz, Shohdi, Osman, & 
Habib, 2010). Although, it can be argued that the findings did not support 
a diagnosis of motor speech disorder or CAS, children with autism had 
voice differences not reported in the CAS group, e.g. inappropriate loud- 
ness, abnormally high pitch. Additionally, they had appropriate rate and 
stress, in direct contrast to symptoms of CAS. Shriberg et al. (2001) use 
these results as evidence of speech attunement issues, rather than a motor- 
speech impairment. However, 75 % of children with autism had increased 
repetitions and revisions; a symptom demonstrated by CAS speakers, with 
both groups producing these significantly more than TD children. As a 
result, some indicators of speech attunement were noted: 


a) Increased repetitions and revisions, consistent with the description of 
autistic speech as “disfluent”. 
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b) Misplaced stress, often described as “off” or “singsong” (Peppé, 
McCann, Gibbon, O’Hare, & Rutherford, 2007). This stress is dissim- 
ilar to the well-documented “excessive-equal” stress pattern in apraxia 
of speech. 

c) Inappropriate loudness and pitch. 

d) Higher rates of speech delay and speech errors relative to population 
estimates. 


A study by Baron-Cohen and Staunton (1994) found that children with 
autism whose mothers were non-native English speakers were more likely 
to develop their mother’s non-native accent (83.3 % of the sample) than 
that of their peers. TD children without social communication difficul- 
ties have a strong drive to identify with peers they engage with regularly. 
A lack of drive to identify with peers, present in autism, could lead to 
weaknesses in opportunities for speech attunement. We would then expect 
children with autism to show higher rates of phonetic distortions. Indeed, 
research by Shriberg et al. (2001) has shown that adults with autism 
frequently produce articulation distortions, such as sibilant and rhotic 
distortions. It should be noted that this study included both adolescents 
and adults, speech distortions did not resolve with age and were classified 
as “residual speech errors”. Cleland et al. (2010) also found a number of 
older children with similar phonetic distortions. These studies tell us that 
speech distortions, which may be a result of poor speech attunement in 
childhood, can continue through to adolescence and adulthood. This is 
an important finding as children with residual speech sound errors face 
an increased risk of social, emotional and/or academic challenges relative 
to their peers with typical speech (Hitchcock, Harel, & Byun, 2015). This 
likely compounds the social and emotional disadvantages children with 
autism already have. 

One aspect of speech attunement that requires examination is the effect 
on suprasegmental attributes such as pitch. Tonal languages such as Chinese 
and Thai rely on the ability to perceive pitch as they involve categorical 
distinctions of lexical tone. Lexical tones serve a phonemic role, they are 
vital for speech comprehension and production (Wang, Wang, Fan, Huang, 
& Zhang, 2017). Wang at el. (2017) found in an event-related potential 
(ERP) study that 16 children with autism had lexical tone processing that 
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was impaired and likely had its root cause as a phonological deficit in cat- 
egorical perception, similar to the findings of Yu et al. (2015). Bonneh, 
Levanon, Dean-Pardo, lossos & Adini (2011) also found abnormal speech 
spectrum and fundamental frequency processing in young autistic children 
who spoke Hebrew. They assessed long-term average spectrum and fun- 
damental frequency variability in 60-second speech samples of 41 children 
with autism using a picture-naming task. Compared to the control group, 
the spectra were shallower and there was less harmonic structure in the 
group with autism. These results imply abnormal processing of auditory 
feedback or elevated noise and instability in the mechanisms that control 
phonation. All of which could have a significant impact on the child’s 
ability to tune into speech. Finally, Lyakso, Frolova and Grigev (2016) 
assessed acoustic features of speech such as fundamental frequency (f0), 
f0 range, formants, frequency and duration in emotional speech, sponta- 
neous speech and repetitions of words in 60 Russian-speaking children 
with autism. Similar to previous studies in tonal languages and Hebrew, 
abnormal prosody was a consistent feature. All children with autism had 
high values of fundamental frequency, abnormal spectrum and well- 
marked high frequency. Stressed vowels also had higher values of fun- 
damental frequency. Results indicated speech abnormalities in autism is 
reflected in their spectral content and fundamental frequency variability. 
Understanding and producing appropriate stress patterns appears to be 
difficult for people with autism. 

The second perspective for increased prevalence of SSEs in autism is 
that a subtle, but significant, motor control impairment in autism causes 
differences in speech production (Adams, 1998; Barbeau, Meilleur, Zeffiro, 
& Mottron, 2015; Belmonte et al., 2013). This perspective is becoming 
increasingly attractive as evidence accumulates that motor disruptions in 
other domains, such as in the purposeful movement of the arms (Crippa 
et al., 2015; Torres et al., 2013), legs and posture movements in gait 
(Nayate et al., 2012; Rinehart et al., 2006). Additionally fine motor control 
during writing and object manipulation (Fuentes, Mostofsky, & Bastian, 
2009) are disrupted in children with autism. A recent meta-analysis of 
motor data in autism suggest motor disruption may be a core feature of 
autism and not merely a co-morbid or associated condition (Fournier, 
Hass, Naik, Lodha, & Cauraugh, 2010). 
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Motor impairment is evident at both gross and fine level in autism. 
There may be a fundamental underlying problem with motor timing 
and integration required to produce the correct, efficient kinematic 
patterns required of skilled movements, including speech (Beversdorf 
et al., 2001; Gowen & Hamilton, 2013; MacNeil & Mostofsky, 2012; 
Mostofsky, Powell, Simmonds, & Goldberg, 2009; Whyatt & Craig, 
2013). Such disruption to movement early in a child’s development is 
thought to contribute to the broad autism phenotype, disrupting expres- 
sive intention and purposeful engagement with others, causing frustra- 
tion, distress and isolation (Trevarthen and Delafield-Butt, 2013). In 
verbal expression, articulating fluently requires intricate control and 
coordination of speech motor mechanisms (Gracco, 1994). Therefore, 
this perspective proposes the increased rate of SSEs present in children 
with autism may be a result of common, underlying motor difficul- 
ties. Indeed, the residual articulation errors reported by Shriberg et al. 
(2001) affect the late acquired and articulatory complex speech sounds 
such as sibilants and rhotics; sounds that require intricate speech motor 
skills. 

Evidence of motor impairment in autism is growing. Neuroanatomical 
correlates have been proposed for the observed difficulties in motor func- 
tioning including abnormalities in the cerebellum (Fatemi et al., 2012), 
disruption in brain synchronization (Welsh, Ahn, & Placantonakis, 
2005), impaired sensory input and multisensory integration (Gowen & 
Hamilton, 2013). As a result, it is suggested that if general motor abili- 
ties are impaired, this could result in a speech motor control impairment 
(Barbeau et al., 2015). Adams (1998) examined oral-motor and motor- 
speech production of four young children with autism compared to TD 
children in both simple and complex phonemic production. Data indicated 
that children with autism had significantly more difficulty performing oral 
movements and complex syllable production tasks compared to TD chil- 
dren. These results could indicate a speech motor impairment. However, 
due to their small sample size, these results are not generalizable. More 
research of speech motor control compared to general motor abilities is 
crucial to understanding the case of SSEs. 

The connection between speech motor control and general motor abil- 
ities has been examined in the TD population. Nip, Green and Marx 
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(2011) found that TD infants showed a correlation between changes 
in articulatory movements and development of early communication. 
Using a motion capture system every three months, the movements of 
the upper lip, lower lip and jaw were recorded from 24 children (between 
the ages of 9-21 months). Children who had reduced speech motor con- 
trol had a delayed trajectory of communication development. Significant 
associations were identified between orofacial kinematics and the stan- 
dardized measures of language and cognitive skills, even when age served 
as a covariate. This initial evidence suggests interactions between cog- 
nition, language and speech motor skills during early communication 
development. Further research is required to identify and quantify causal 
relations among these co-emerging skills and whether this extends to 
general motor ability. Alcock (2006) also found that motor control was 
associated with an existing language impairment, particularly oral motor 
control. Moreover, Lewis et al. (2011) found children with SSD were 
slower to complete diadochokinesis tasks and had differences in their 
oral motor control compared to TD children. These studies provide evi- 
dence of an inherent link between speech and general motor capabilities, 
though whether there is a causative mechanism (in either direction) is 
unknown. 

Little research explores the relationship between speech and general 
motor impairment in children with autism. Nevertheless, there have been 
some interesting findings in children with idiopathic speech disorder who 
have been found to have reduced performance on tasks that involve visual 
motor control and fine motor control, e.g. grasping and object manipu- 
lation (Newmeyer et al., 2007). Peter and Stoel-Gammon (2008) found 
children with SSEs had deficits in repetitive finger tapping and clapping 
exercises associated with fine motor control. Lewis et al. (2011) found 
children with SSE were slower to complete diadochokinesis tasks and 
a maximum phonation task associated with the competency of speech 
motor control, compared to TD children. Bradford and Dodd (1994) com- 
pared ten phonologically delayed children, ten children with consistent 
phonological disorder and ten children with inconsistent error patterns. 
Groups did not differ on simple motor tasks; however, the group with 
inconsistent error patterns performed significantly worse in timed motor 
planning tasks and expressive novel-work learning tasks than in the other 
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two groups. These results provide support for the perspective that incon- 
sistent error patterns are associated with a deficit in some aspects of fine 
motor planning, a similar pattern that has also been identified in autism 
(Fournier et al., 2010). 

Timing is a fundamental aspect of speech production. Fluent speech 
requires information to be selected, sequenced and articulated in an accu- 
rate and time sensitive manner. A set of quasi-autonomous articulatory 
systems need to work in coordination (Kotz & Schwartze, 2016; Maassen 
& Van Lieshout, 2010). Whilst little research has been carried out on 
speech timing, studies suggest there may be abnormalities in sensorimotor 
timing in children with autism. Anzulewics, Sobota and Delafield-Butt 
(2016) found an increase in the speed of fast taps and swipes in chil- 
dren with autism playing an iPad game. Torres, et al. (2013) found an 
increase in the acceleration-deceleration phases of a reach-to-touch task 
in children with autism. These tasks demonstrate a subtle, but signifi- 
cant disruption to moment-by-moment control of movement occurring 
in the region of 30-70 ms, a temporal domain important for speech. 
Over- and under-compensations of such rapid shifts in force are thought 
to underpin the overt motor disruptions typically observed (Trevarthen 
& Delafield-Butt, 2013; Whyatt & Craig, 2013). These compensations 
may affect basic perception and effect experience resulting in disrupted 
speech development due to lack of coordination of articulatory systems 
(Colwyn Trevarthen & Delafield-Butt, 2017). Cook, Blakemore and 
Press (2013) found sub-second control of velocity and acceleration was 
affected in individuals with autism in simple arm-swing tasks. This study 
indicated that fast timing at less than a second (sub-second) required of 
speech motor control might be disrupted in limb and hand movements in 
individuals with autism. 

Future research needs to look at both linguistic and motor planning 
skills in children with autism to describe accurately the range of cogni- 
tive processes that may be affecting their speech production. These studies 
above indicate there may be a deficit in motor planning and programming 
associated with speech sound disorders, but the origin is still unknown 
(Shriberg & Kwiatkowski, 1994). Therefore, it is important we look at the 
studies on SSEs in autism in detail to determine what knowledge exists on 
their nature and causes. 
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8. Conclusion and future directions 


Researchers have identified atypical speech development in children with 
autism (Shriberg et al. 2001; Cleland et al. 2010; Wolk & Brennan 2013). 
However, it has been argued these errors are within a sequence of normal 
development (delayed) rather than atypical (Kjelgaard & Tager-Flusberg 
2001; McCleery et al. 2006). Inconsistent outcomes in the literature 
may be a result of inconsistent and reduced specificity of the perceptual 
measurements used across studies. This could also reflect the heterogeneity 
in the population of people with autism in their production of speech. 
If deficits in speech motor control mirror the deficits in fine motor con- 
trol, then finer-grained techniques may be needed to identify them. This is 
important, because even if speech motor control problems are subtle, their 
existence might indicate that an underlying motor impairment is at the 
heart of autism. It is unlikely that subtle speech motor control problems 
will be identified with judgments on the correctness of productions of single 
words. Instead, one needs speech tasks such as maximum-performance 
tasks that tax the motor system. Alternatively, it is possible that articu- 
latory analysis will identify qualitative differences in the articulations of 
children with autism compared to typical speakers. Indeed, articulatory 
analysis, namely ultrasound tongue imaging, has been used in one study to 
assess and treat abnormal articulations in children with autism. Cleland, 
Scobbie, Heyde, Roxburgh and Wrench (2019) (found that ultrasound 
visual feedback might facilitate speech sound learning. While the study 
was not focused exclusively on children with autism, three of the chil- 
dren presented with SSEs and autism within the sample and responded 
to intervention. Although ultrasound tongue imaging is at early stages of 
development for assessment and intervention, it is a promising method of 
analysing SSEs in the depth required to identify subtle articulation errors 
in children with autism. 

In conclusion, it is vital to determine why SSEs may be occurring and 
whether such occurrences are a result of disruption to speech attunement, 
a disruption to speech motor issues, or, more likely, both. Each aspect 
of speech development and production could affect the other, the two 
are entwined within the life of the child. Children with autism appear to 
have less drive to attune to the speech of peers due to particular social 
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impairments. This may result in a reduction of motivation to produce 
speech that is intelligible and functional for others to comprehend. This 
may explain why we see prosodic abnormalities and unusual distortions 
errors, which do not affect intelligibility, such as phoneme-specific nasal 
emission (Cleland et al., 2010) and difficulties with articulatory com- 
plex speech sounds (Shriberg et al., 2011). Conversely, the disruption in 
speech motor performance that thwarts its intended meaning for others 
can itself drive a reduction in motivation to attune, leading to the same set 
of autistic consequences. Either way, improved aetiological understanding 
will help to determine principal underlying capacities and therefore routes 
to more effective intervention. Current research does not provide a clear 
picture of what theory best applies — if indeed either theory is appropriate 
without consideration of the other. In addition, it may be that there are 
subgroups of children with SSEs within the broad autism spectrum. Such 
ideas require testing. Future research also needs to look equally at both lin- 
guistic and motor skills in children with autism to describe accurately the 
range of mental and neurophysiological process that may be affecting the 
production of speech. Understanding these questions will help to improve 
effective speech therapy interventions to target the underlying disruptions 
that give rise to SSEs at an early age and develop bio-markers for earlier 
diagnosis of autism. 
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Joanne Cleland and James M. Scobbie 


Acquisition of new speech motor plans via 
articulatory visual biofeedback 


Abstract: This chapter describes the concept of categorising persistent Speech Sound 
Disorder in children as a disorder characterised by erroneous motor plans. Different 
types of articulatory visual biofeedback are described, each of which is designed to 
allow children to view their articulators moving in real time and to use this infor- 
mation to establish more accurate motor plans (namely, electropalatography, elec- 
tromagnetic articulography and ultrasound tongue imaging). An account of how 
these articulatory biofeedback techniques might lead to acquisition of new motor 
plans is given, followed by a case study of a child with persistent velar fronting who 
acquired a new motor plan for velar stops using ultrasound visual biofeedback. 


Keywords: visual feedback, articulation, Speech Sound Disorders, 
electropalatography, ultrasound, electromagnetic articulography 


1. Introduction 


Children with Speech Sound Disorders (SSD) have difficulty acquiring the 
speech sounds of their native language in the course of normal develop- 
ment; producing certain sounds incorrectly, substituting them with other 
sounds or omitting them altogether. SSDs are the most common type of 
communication impairment; around 11.5 % of eight-year olds (Wren, 
Miller, Emond, & Roulstone, 2016) have SSDs ranging from common 
distortions such as lisps and /r/ distortions to speech that is unintelligible 
even to close family members. 

For many children, the cause of their SSD is unknown (though SSDs 
are also associated with a range of conditions including hearing impair- 
ment and cleft palate) and is usually thought to arise from a difficulty 
acquiring the phonology of their ambient language. Indeed, most children 
with SSDs have “phonological” impairments (87.5 % in an analysis of 
caseload referrals by Broomfield & Dodd, 2004). It appears that a lesser 
number (12.5 % of caseload) have “articulation disorders”, in that they 
more clearly have a problem producing certain (normally late-acquired) 
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speech sounds. Overall, the problem is thought to be mainly cognitive, so 
that children have difficulty learning the patterns of their language which 
often leads them to display the simplification processes representative of 
an earlier age in typical development, for example by reducing clusters or 
replacing velars with alveolars, resulting in phonological merger. 

In therapy, the resulting homophony motivates remediation in part 
by confronting children with their inability to signal contrast. There is 
good evidence that in young children these auditory-based phonological 
interventions, for example minimal pairs intervention (Law, Garrett & 
Nye, 2003) are very effective. However, in around half of children with 
SSDs the problem persists into the school years, and a smaller number still 
become “intractable”, beyond the age of eight. There is growing evidence 
that these children may not have a purely cognitive phonological disorder, 
but display (also) subtle motor problems. For example, Wren et al. (2016) 
found that weak sucking at six weeks of age is a risk factor for SSD at eight 
years of age. These types of potentially motoric speech impairments need 
interventions that capitalise on the principles of motor learning (see Maas 
et al., 2008 for a tutorial). Children with ingrained incorrect motor plans 
(for example, children who persistently misarticulate certain phonemes) 
need motor-based techniques for teaching and practicing new articulatory 
gestures. 

In the motor-learning literature, the ontogeny of complex movements is 
studied by looking at an individual’s ability to imitate a novel movement 
(Paulus, 2014). This is problematic for children who haven’t acquired 
articulatory gestures via the normal auditory route because the main artic- 
ulator, the tongue, is largely hidden from view. Researchers and clinicians 
have therefore sought to circumvent this problem by augmenting the 
acoustic (and tactile) information already available to the speaker through 
the use of instrumental imaging technologies conveying aspects of vocal 
tract articulation directly to the speaker, that is, by providing biofeedback. 


2. Articulatory feedback approaches 


In phonetics the use of instrumental techniques to measure movement of 
the articulators has a longer history than of sound recordings being used 
to measure acoustics, beginning with static palatography in the late 18th 
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century through to cine-Magnetic Resonance Imaging (MRI) in recent 
years. Techniques like electropalatography (EPG) and electromagnetic 
articulography (EMA) are well established, with ultrasound and MRI 
gaining popularity thanks to methodological improvements and falling 
costs. All of these techniques give researchers data that can be used to 
create visual images of otherwise invisible articulators, especially the 
tongue. However, only a small number allow data to be visualised in real 
time in a way that is immediately meaningful to the viewer, namely EPG, 
EMA and Ultrasound Tongue Imaging (UTI). Since the 1980s (Dagenais, 
1995) the potential for using visualisations of the articulators as a pow- 
erful speech therapy tool has been explored. Most of the research to date 
has focussed on EPG, with a large number of “small n” studies showing its 
potential as a visual biofeedback (VBF) device (Gibbon, 2013). 

EPG is a technique for displaying the timing and location of tongue- 
palate contact (Hardcastle & Gibbon, 1997). The speaker sees an abstract 
representation (Figure 1) of linguo-palatal contact, which is very useful for 
conveying aspects of coronal (and dorsal) consonants (and some vowels) 
in real time, and is encouraged to use this to modify their own erroneous 
articulations. It is worth noting that the display in EPG is normalised. All 
speakers see the same display irrespective of the size and shape of their 
hard palate. This potentially makes the display easier for the Speech and 
Language Therapist (SLT) to interpret. Additionally, the anterior third of 
the EPG palate is displayed in the anterior half of the normalised com- 
puter display. This is because the tongue-tip (the part most often in con- 
tact with the anterior part of the palate) contains more nerve endings and 
achieves more fine-grained articulation. While the % to 1⁄2 ratio is arbi- 
trary, the understanding of this visual display is thought to be relatively 
intuitive (Gibbon & Wood 2010), even for those with cognitive impair- 
ment (Cleland et al. 2009). 

While EPG shows tongue-palate contact rather than visualising the 
articulators directly, EMA shows the movements of a small number of 
specific flesh-points. Sensors are directly attached (glued) to articulators 
such as the jaw, lips, and (crucially) the tongue, and can be visualised in 
real time on a computer screen (Figure 1). While EPG shows 62 points of 
contact on the hard palate, EMA normally tracks a much more limited 
number of points: usually three sensors attached near to the midsagittal 
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Figure 1: Instrumental articulatory technique displays (not recorded 
simultaneously). From left to right: MRI-derived animation (produced with 
permission from Eleanor Lawson), electropalatography, Ultrasound, Opti-Speech 
(electromagnetic articulography). 


tongue tip, then two more on the front of the tongue, about 1.5cm and 
3cm posterior (Katz & Mehta, 2015) which is about as far into the ante- 
rior oral cavity as can be reached easily. More recent systems, for example 
the Wave Electromagnetic Speech Research System (NDI, Waterloo, ON) 
allow three-dimensional tracking of five small sensors affixed to the client’s 
tongue. Software such as “Opti-Speech” (Vick, Mental, Carey, & Lee, 
2017) shows the sensors in the context of an avatar (see Figure 1). 

EMA has been popular in articulatory phonetics studies because it 
is one of the few techniques which allows velocity and acceleration of 
movements to be calculated and interpreted easily, because of the flesh 
point tracking. However, it is not likely that speakers control speech pro- 
duction in terms of a small number of such points, nor that in experimental 
studies the most meaningful points are selected, nor studied in a replicable 
manner. In terms of biofeedback, EMA has not been particularly popular: 
the equipment is expensive, positioning the sensors on the articulators 
requires training, and it is potentially invasive, especially for children. 
However, a small number of studies have shown it to be potentially useful 
for VBE. Katz and Mehta (2015) evaluated the technique for teaching na- 
tive speakers of American English to produce the non-English segment [q]. 
In this study, the Opti-Speech system was used to display the EMA sensors 
superimposed on an animated avatar showing the tongue in a mid-sagittal 
head context. Target areas for the sensors were also shown, and on-target 
articulations were highlighted by changing the sensor colour from red to 
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green. Results indicated a rapid gain in accuracy associated with visual 
feedback training. However, extrapolating from these results into the clin- 
ical domain should be interpreted with caution for three reasons: firstly, 
the speakers did not have SSDs; secondly, the speakers were not asked to 
integrate the new articulation into words; and lastly a similar experiment 
by Cleland, Scobbie, Nakai, and Wrench (2015) using ultrasound showed 
that retroflexes were just as easy to teach to English-speaking children 
using auditory methods as they were with VBF. 

To date, just one study has used the Opti-Speech (EMA) system to treat 
residual speech errors in children and young people. Vick et al., (2017) 
treated residual /s/ (two children) and /r/ (two children) distortions. Early 
results showed that it is possible to use the technique to remediate these 
errors, and that generalisation can occur. However, further research is 
needed to determine the effectiveness of EMA for treatment of SSDs and 
also to determine whether clinicians in the field find this technique useable 
in the practical sense. 

In contrast to these studies which use direct EMA displays of the real- 
time movements of sensors, more recent research has sought to gamify the 
articulatory information, again in (near) real time. Yunusova et al. (2017) 
used a single tongue tip sensor to drive a computer game in which the 
object was for a dragon character to breathe as much fire as possible. 
The size of the dragon’s flames was directly related to the size of the 
speaker’s articulatory working space (AWS). In this case, the augmented 
VBF was designed with a very specific population in mind: speakers with 
Parkinson’s disease. This particular neurodegenerative condition causes a 
reduction in articulatory movements (causing dysarthric symptoms such 
as undershoot) and leads to reduced intelligibility. By providing a meta- 
phor (the fire-breathing dragon) which visually produces more fire in cor- 
relation with increasing AWR, speakers with Parkinson’s disease were able 
to use the feedback to increase their intelligibility. Increasing the strength 
and range of movements which already follow the correct articulatory tra- 
jectory is, however, quite different from establishing a correct gesture in 
replace of an erroneous one (for example, a central fricative produced 
laterally), or an absent one (for example, in someone who has no velars in 
their phonetic inventory). Therefore, any gamification of VBF designed for 
establishing new articulations is likely to need games which relate more 
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directly to the trajectory of a specific segmental gesture rather than to the 
global magnitude of change during the production of a word. 

In contrast to EPG and EMA, which show a discrete number of 
points, U-VBF shows an anatomically accurate speaker-specific repre- 
sentation of the tongue. With this technique most of the surface of the 
tongue is visible in a mid-sagittal view (Figure 1), and interpretation of 
the images is thought to be relatively intuitive (Bernhardt et al. 2005). In 
contrast to EPG, the image is an anatomically correct representation of 
part of the tongue, however, other important anatomical information, 
such as the relation of the tongue to the hard palate, is not normally 
visible (Cleland et al., 2019). Moreover, this “raw” ultrasound suffers 
from artefacts, and the tip of the tongue is often in shadow from the 
mandible. However, ultrasound has practical advantages over EPG and 
EMA in that it does not require expensive individual artificial palates 
or expensive sensors. Moreover, since it involves no intra-oral equip- 
ment it is less physically invasive, potentially making it more suitable 
for children. 

Given the practical limitations of EMA most of the clinical studies in 
the literature have used EPG and, more recently Ultrasound-VBE. Indeed, 
U-VBF is rapidly gaining popularity, probably because of its lower cost 
and because more portable high-speed ultrasound systems are now avail- 
able. To date, 29 small studies have been published in the literature inves- 
tigating the efficacy of U-VBF (see Sugden, Lloyd, Lam and Cleland, 2019 
for a systematic review). Of these studies, 27 were published in the last 
10 years and 17 in the last three. While larger clinical trials of both EPG 
and UTI are needed in the future, it is essential to know theoretically why 
and how these techniques work because identifying the agents of change 
(the “active ingredients”) in an intervention is essential for refining the 
intervention and establishing dosage. 

None of these instrumental techniques are therapies in their own 
right (Bacsfalvi et al. 2007); most SLTs use them to supplement tra- 
ditional techniques, such as articulation therapy (Van Riper & Emerick, 
1984) or motor-based intervention (Preston et al., 2013). One key ingre- 
dient of articulatory VBF is that it can be used to demonstrate complex 
articulations that are normally difficult to describe. Describing articulatory 
movements is an essential part of traditional articulation therapy (Van 
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Riper and Emerick, 1984). Normally this is done with verbal descriptions, 
or perhaps diagrams, ranging from impromptu sketches to computer 
animations. 

It is crucial, moreover, to unpick the visual model aspect of EPG/UTI 
from the biofeedback aspect. That is, we need to know the extent to 
which a speaker benefits from informative general visual models of artic- 
ulation, and the extent to which real-time biofeedback of the learner’s 
own tongue during speech production provides crucial additional 
information. 

Considering first the model aspect on its own, studies which inves- 
tigate the use of an articulatory model to teach new speech sounds are 
few. Massaro et al. (2008) used a “Talking Head” to teach native English 
speakers a new vowel [y] and consonant [q]. Talking Heads are artificial 
animations of speech usually based ultimately on instrumental (e.g. MRI 
or EMA) data. Some are 3D (e.g. Badin & Serrurier, 2006) and some are 
2D (e.g. Kröger et al., 2013), but most attempt to model the movement of 
the tongue during speech with a cut-away profile or mid-sagittal view of 
the tongue. 

The main application of Talking Heads is usually as a teaching tool 
for pronunciation training in second language learning (Cleland et al., 
2013). However, there is little evidence that this is effective. In the Massaro 
et al. study (2008) a view of the lips was useful for teaching the high- 
front rounded vowel [y] but a mid-sagittal Talking Head did not improve 
learning of the distinction between [k] and the uvular stop [q]. There is a 
confound here, however, due to one study involving a segment where lip- 
rounding is the defining feature and one where it is uvular place: lip reading 
is not only a natural phenomenon but one known to improve perception 
of speech (see below). Similarly, a study by Fagel and Madany (2008) 
which used a Talking Head to teach [s] and [z] to children with interdental 
lisps failed to show an effect. Thus, a visual model alone appears not to be 
the essential ingredient for success. However, since the above studies did 
not give the learners any information about closeness to target (e.g. from a 
human judge or automatic speech recognition), and since articulatory con- 
striction is a key feature of production, further study is required to directly 
compare an articulatory model against VBF using the same type of display 
and mediation. 
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3. Theoretical explanations for the role of 
biofeedback in learning new articulations 


Children who make inappropriate phonetic realisations of certain speech 
sounds do so because they have an inappropriate motor plan for that sound 
(Preston et al., 2014; Cleland et al., 2019). Cleland, Scobbie and Wrench 
(2015) suggest that these erroneous motor plans can be ascribed to one of 
three categories: 1. It is identical to that of another phoneme, resulting in 
perceived homophony (as in canonical velar fronting); 2. the motor plan 
is abnormal or underspecified resulting in something which is perceived 
as homophonous but is subtly different in some way (as in covert con- 
trast, Gibbon & Scobbie, 1997), for example /t/=[t] and /k/=[t]) or; 3. the 
motor plan is abnormal to the extent that it results in the realisation of an 
obviously non-native speech sound, for example a lateral lisp in English- 
speaking children. It is possible that different types of VBF are needed to 
overcome each of these erroneous motor plans. In the case of category 1, 
normally a phonological cause would be ascribed, however Cleland et al. 
(2017) present several cases of children with persistent velar fronting with 
identical tongue-shapes for /t/ and /k/ but awareness of the error and (ini- 
tially) an inability to produce a velar articulation of any type. In these, 
and other cases, the inability to produce the correct articulatory gesture 
upon imitation is often coupled with a lack of understanding (despite pre- 
vious intervention) of how the gesture is achieved at all, with one of the 
children in the Cleland, Scobbie and Wrench (2015) study stating that she 
thought producing a velar was “impossible” the first time she viewed an 
ultrasound movie of that segment, highlighting the lack of understanding 
she had as to the movements required to achieve a velar despite previous 
therapy targeting this very sound (Cleland et al., 2019). 

In addition to a lack of explicit understanding about the movements 
required to achieve a particular sound, there may be some implicit 
learning involved in the viewing of tongue movements. In typical audio- 
visual speech perception, viewing the speaker’s lips enhances perception, 
particularly in noise (Benoit & Le Goff, 1998). Typical speakers integrate 
lip information into their perceptual system, as shown by the McGurk 
effect (McGurk & MacDonald, 1976). Clearly whilst lips are easily vis- 
ible during interactions, the tongue is not. Even so, Badin, Tarabalka, 
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Elisei, and Bailly (2010) suggest that it is possible to “tongue-read” in 
the same way as it is possible to lip-read. That is, viewing a Talking 
Head of tongue movements leads to better discrimination of speech in 
noise and potentially could be used for learning new articulations. Badin 
et al. (2010) hypothesise that this is due to a natural, intuitive ability 
for listeners/viewers to tongue-read, suggesting that this provides sup- 
port for a perception/production link which could relate to the theory 
of mirror neurons (Cleland et al., 2019). Mirror neurons are thought to 
underlie the imitation system, because they are neurons that fire when a 
person both sees an action being performed (or hears it being performed, 
in which case they may be called echo neurons) and performs that ac- 
tion themselves. So, in theory, when a person hears a speech sound, the 
neurons in the motor area required for articulating that speech sound fire. 
In fact, even passive listening to speech sounds evokes a pattern of motor 
synergies mirroring those occurring during speech production (D’Ausilio, 
Bartoli, Maffongelli, Berry & Fadiga 2014). There is emerging evidence 
that this does not just apply to hearing a speech sound, but also to seeing 
it. Treille, Vilain, Hueber, Schwartz, Lamalle and Sato (2014) showed 
activation in the premotor and somatosensory cortices when observing 
lingual movements from ultrasound, suggesting that demonstration of 
correct articulatory movements may be a crucial aspect of visual bio- 
feedback. Moreover, using delayed U-VBF might evoke the same process. 
In this type of feedback, the child (as well as watching the live visual 
biofeedback) watches their own production replayed after a delay (once 
they have finished speaking, not to be confused with delayed auditory 
feedback, which has very short delay times). The SLT then encourages 
the child to reflect on the correctness of their production. While viewing 
their own incorrect production could potentially have an adverse effect, 
viewing their own correct production gives a speaker-specific representa- 
tion of the required articulatory gesture. 

Whilst it would be unethical and ethically dubious to compare U-VBF 
without demonstration to U-VBF with it, it would be feasible to conduct 
a randomised control trial where one arm of the trial involved the use 
of an ultrasound-based visual articulatory model, without biofeedback 
(Cleland et al., 2019). Indeed, a small study of speakers with cleft palate 
(Roxburgh, 2018) found that the children did just as well with a visual 
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articulatory model to learn new articulations as they subsequently did with 
U-VBE. However, this study was limited by a small sample size of just two 
participants, and that neither had had previous therapy to address the rele- 
vant speech problem (i.e. they were not ‘intractable’, Cleland et al., 2019). 

The question remains as to how VBE, or indeed a visual model alone, 
could lead to acquisition of new articulations, especially when, in the case 
of intractable SSDs, the speakers have been exposed to extensive models of 
the correct articulation from other speakers, albeit only in auditory form. 
It seems in this case that the auditory imitation system has failed somehow, 
perhaps enabling the visual modality to offer useful new information. 
Indeed, evidence exists that the observation of completely novel behav- 
iour (in this case a previously unseen articulatory movement) generates 
mirroring activity in the premotor cortex (Cross, Hamilton and Grafton, 
p. 11, 2006). Moreover, Mattar and Gribble (2005) show that complex 
motor behaviours, which speech undoubtedly is, are greatly assisted by 
first observing another engage in the activity. Via this mechanism, models 
of the new activity are formed in the premotor cortex via the mirror 
neurons and presumably intensity of neuronal firing increases with prac- 
tice/exposure. It is not enough to simply watch the new movement repeat- 
edly and expect acquisition of a new motor plan: practice is required by 
the speaker. (Imagine trying to learn the piano only by watching videos of a 
pianist’s fingers!) Del Giudice, Manera and Keysers (2009, p. 352) explain 
the mechanisms by which practice of movements leads to acquisition, by 
looking at grasping: “activity in the premotor cortex leads to a grasping 
movement. The movement is seen by the acting individual, causing activity 
in neurons in the temporal cortex. This activity is sent to the parietal and 
premotor cortex, where it finds neurons that are active because the subject 
is currently performing the action. This leads to Hebbian enhancement of 
the congruent connections from temporal to parietal and from parietal to 
premotor neurons representing the same action; incongruent connections 
do not undergo such enhancement”. It is therefore conceivable that seeing 
a novel speech motor movement leads to development, or otogeny, of the 
mirror neuron whilst actually doing the novel tongue movement yourself 
leads to Hebbian enhancement, which in turn is enhanced by lingual visual 
biofeedback. Repeated association of the sound (knowledge of results) 
with the movement (knowledge of performance) leads to enhancement in 


Acquisition of new speech motor plans 149 


acquisition of the new skill. Of course, this ought to be entirely possible 
with only the articulatory model, provided the speaker is able to practice 
accurately, and biofeedback may not be required. However, it is likely 
that some individuals are unable to make the leap between seeing the new 
articulation and beginning to practice it themselves, that is, no matter 
how many times they see it they cannot perform it, or even approximate 
a performance of it. In this case the speech and language therapist too 
benefits from the visual feedback as s/he is able to use shaping techniques 
(Bleile, 2004) to explicitly demonstrate to the speaker that similar motor 
programmes are already within their grasp. 

Evidence for the biofeedback aspect of U-VBF comes from experiments 
on experiential canalised learning. Canalisation is the means by which a 
developmental process is buffered against perturbations. It ensures that 
important features of the organism emerge reliably despite great varia- 
tion between individuals in environmental conditions and genotypic 
makeup. The classic example is that of ducklings raised in incubators 
which still spontaneously exhibit the ‘correct’ preference for their own 
species’ maternal calls, despite never hearing a mother duck. However, if 
the ducklings are prevented from hearing their own vocalizations, they fail 
to exhibit selective responses to maternal calls (Gottlieb, 1991) suggesting 
a key factor is self-produced vocalizations. That is, the speaker must make 
the articulatory movements themselves and evaluate the acoustic output 
in order to acquire them. Visual biofeedback offers a new modality for 
learners who have failed to acquire speech sounds via the normal routes. 
Moreover, in live bio-feedback the speaker is able to bootstrap the new 
visual modality not only onto the auditory modality but also onto the 
haptic modality to make small adjustments to their articulatory gestures 
in real time. In the speech therapy clinic this is evidenced by articulatory 
groping towards the target in the early stages of intervention. 

In sum, U-VBF works by first showing the learner what is to them a 
novel movement, then performance of the new movement leads to Hebbian 
learning, which is boosted by the visual knowledge of performance pro- 
vided by U-VBẸ this leads to increasing activation of the mirror neuron, 
laying down of a new general motor programme and hence eventually mas- 
tery of the new sound. If the mastery of the new sound is a gradual process 
then we might expect to detect various types of phonetic gradience in the 


150 Cleland and Scobbie 


short-term longitudinal change, potentially in addition to rapid categorical 
change. Some evidence of incomplete generalisation of a new articulation is 
shown in U-VBF studies where post-intervention scores for target segments 
are lower than 100 % correct. For example, Cleland, Scobbie, Roxburgh, 
Heyde and Wrench (2019) show that after intervention children with a 
wide variety of lingual errors show improvements in accuracy of targeted 
gestures, but no child achieved perfect percent target consonants correct in 
all phonotactic contexts. However, the approach of categorising segments 
within words as correct or incorrect obscures the potential subtlety of the 
process. More important for understanding the pathway to acquisition is 
the fine detail necessary for a full evaluation of new articulations produced 
by children as the result of clinical intervention. 

For example, consider the two children reported by Cleland et al. 
(2019) who made progress towards the target, changing posterior (pha- 
ryngeal fricatives for sibilants) to anterior articulation, but with incor- 
rect lateral airflow. For these children, the updated motor plan is more 
accurate, since in it contains more of the correct features of the target, 
even though the output is still wrong linguistically. The motor plan has 
therefore changed in a gradient manner, as both children also show pro- 
gress towards achieving the correct airflow. However, gradient acquisition 
of targets may manifest differently in each of the three erroneous motor 
plans 1. Motor plans identical to another sound; 2. Motor plans which are 
covertly different but perceived as a different sound and 3. Motor plans 
which result in a non-native sounding phone. Type one is particularly 
interesting, because in a traditional model these children would be said 
to have classic substitution errors, thought to be phonological in nature. 
If this were the case, we would not expect these children to acquire a new 
articulation in a phonetically gradient manner (though they may acquire 
it in some phonotactic conditions before others as is the case in typical 
acquisition of a segment). 

What follows is a case study of a child who presented with a classic 
substitution error who nevertheless shows gradient change during reme- 
diation. Rather than presenting only binary information on the correct- 
ness of her new articulations, which would obscure more subtle changes, 
we explore the process in more articulatory detail during the therapeutic 
process. 
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4. An illustration of gradient acquisition 
of a new articulation 


While typically developing children are usually able to produce velars cor- 
rectly by the age of three and a half years (Dodd, 2013), those with SSDs 
may not be able to produce velars till much later. A lack of velars in a 
child’s phonetic inventory has been recognised as a prognostic indicator 
for a phonological disorder (Grunwell, 1987). Children who persistently 
fail to differentiate coronal and dorsal articulations may therefore have an 
underlying motoric deficit. Gibbon (1999) suggests that this may manifest 
as an “Undifferentiated Lingual Gesture” (ULG), where the tongue moves 
as a whole, rather than, as expected, by executing gestures using indepen- 
dent parts. Children with UGs show abnormally extensive tongue-palate 
contact patterns in EPG studies (Gibbon 1999) and (in just one study to 
date) abnormal dorsal raising in ultrasound (Cleland et al., 2017). This 
error pattern is motoric, rather than phonological. 

While there are studies showing these abnormal articulations, there are 
no studies showing how articulations change as children initiate a cor- 
onal/dorsal differentiation or achieve mastery of it. In several of our pre- 
vious studies (Cleland et al., 2015b, 2017, 2019) we reported on children 
who persistently front velars to alveolars, despite being over six years of 
age. Velar fronting is readily remediated using U-VBE, with some chil- 
dren showing a categorical shift from 0 % velars correct pre-therapy to 
100 % post-therapy. Speaker “07F_Ultrax” is reported in Cleland et al. 
2015 and 2017. At the time of the U-VBF intervention she was aged 7;6 
and presented with velar fronting in the absence of a history of any other 
errors. Pre-intervention, she produced no correct velars, half-way through 
intervention she was not perceived to produce any correct velars, but 6 
weeks later, at the end of the intervention period, she produced 100 % 
correct velars in a word list designed to probe this segment in multiple 
phonotactic positions. She maintained that gain three months later. Prior 
to intervention she produced both /t/ and /k/ with identical tongue shapes, 
in other words, a classic merger (see Cleland et al., 2017) appears to have 
been almost instantly fixed. We turn our attention now to an ultrasound 
analysis of 07F’s productions of alveolars and velars at various time-points 
in the intervention process. 
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07F_Ultrax was recorded with simultaneous high-speed ultrasound 
and audio. The ultrasound was probe-stabilised with a headset (Scobbie, 
Wrench & Van der Linden, 2008) to allow us to compare tongue shape for 
/t/ and /k/ directly. Materials were a wordlist containing velars in a wide 
range of vowel environments and word positions. 

Using AAA v2.16 software (Articulate Instruments, 2012) /t/ and /k/ 
segments were annotated at the beginning of the burst, the nearest ultra- 
sound frame was then selected and a spline indicating the tongue sur- 
face fitted to the image using the semi-automatic edge-detection function 
in AAA software. Splines were then averaged by target segment and 
compared. 

In this case, we are interested in the degree of separation between /t/ and 
/k/. If O7F presents with merged productions of /t/ and /k/, then we would 
expect to see no degree of separation between /t/ and /k/ and if she presents 
with ULGs for both, then we might expect a reduced degree of separation 
between /t/ and /k/ compared to typically developing children. The differ- 
ence between /t/ and /k/ can be characterised as maximum radial dorsal 
difference between these two segments (Figure 2). 

Scobbie and Cleland (2017) report the average maximum width of the 
radial difference between /t/ and /k/ at mid-closure for 30 typically devel- 
oping children as 11.9mm, 7.5mm and 12.1mm for symmetrical /a/, /i/ and 
/o/ contexts respectively. 

By applying the same measurements (Figure 3) to all the time-points 
from 07F’s data, we can quantify the gradient increase in the degree of 
separation between /t/ and /k/ at each time point (Figure 4). What is inter- 
esting, is that by looking only at percent target consonants correct, 07F 
appears to make a categorical shift from 0 % to 100 % correct between 
mid-therapy and post-therapy sessions, whereas in fact she was already 
beginning to change the production by the mid-therapy session (panel 
2) while in the post-therapy session (panel 3) her coronal/dorsal differen- 
tiation (6.12mm) actually remained abnormally small. Presumably with 
practice, as is consistent with the motor learning literature, over time her 
articulations become more phonetically accurate, until the point where 
/t/ and /k/ are perceived by a listener as occupying different perceptual 
categories. 
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Figure 2: Average /t/ and /k/ from 30 typical children at mid-closure. The 
diagonal spokes are some of the radial fanlines (emanating from the probe’s 
virtual centre) used for measurement. For each individual child the maximum 
distance /k/-/t/ along some fanline (in this case, the 4th diagonal line from the left) 


within the anterior and posterior crossing points of the splines for each individual 
child is taken as the degree of coronal-dorsal differentiation. 


5. Conclusion 


Since the 1980s instrumental phonetic techniques have increasingly been 
applied as biofeedback for learning new articulations in children who 
have failed to acquire particular phones through the normal route. While 
EPG has dominated the literature as the technique of choice, and has 
been shown to be successful for a large number of children, recent studies 
have focussed on ultrasound visual biofeedback. For the most part VBF 
is described as a motor-learning approach, though it is often used with 
children who present with errors described as “phonologically delayed”. 
The case study above shows that even in these cases, evidence of subtle 
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TD Mean=12, SD=3 
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Figure 4: Max radial difference of /k/-/t/ for 07F over time. Y-axis, radial difference 
between /k/ and /t/, x-axis intervention time point. Grey dashed box: expected 
radial difference between /k/ and /t/ for typically developing children. 


motor-impairments can exist. This calls into question the underlying 
impairment these children have. However, we wish to caution the reader 
from drawing the conclusion that all children with “phonological delay” 
in fact have motor-based problems. Evidence from a large study by Wren 
et al. (2016) shows that early signs of subtle motor impairment such as 
weak sucking at six weeks of age, predicts persistent SSDs, and not SSDs 
which remediate in the preschool years. It therefore seems plausible that 
children with persistent disorders, as exemplified here, are a different sub- 
group from the outset. 

The agents of change in VBF remain underexplored. There are at 
least four different potential “active ingredients” in VBF therapy that do 
not exist in traditional approaches: 1. Improved diagnostic information 
provided by articulatory analysis prior to intervention; 2. An accurate 
visual articulatory model provided by target patterns/tongue movements; 
3. Increased accuracy of positive feedback from the treating SLT made 
possible by viewing movements; 4. Biofeedback. In reality a combination 
of all these factors likely impacts on the ability of children to achieve, 
practice, and ultimately generalise new articulations following biofeed- 
back interventions. 
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Marion Dohen 


Do manual gestures help the learning of new 
words? A review of experimental studies 


Abstract: We all produce manual gestures when we speak and these gestures have 
been shown to play an important role in the act of communicating. The aim of 
this chapter is to further investigate the specific role played by manual gestures 
in combined semantic and lexical learning by reviewing the experimental evi- 
dence provided by the literature. Nineteen articles met our selection criteria. They 
explore the effect of manual gestures in learning new words in both typically 
developing and speech and language impaired participants. Even though it was 
not an exclusion criterion, none of the studies dealt with adults: all tested children 
of various ages. Several research questions are addressed: 1. Is there a general 
advantage of using manual gestures in learning new words? 2. Is there a specific 
effect of manual gestures vs. other additional cues? 3. Is there a differential effect 
on learning to comprehend and to produce the newly learned words? 4. Do dif- 
ferent types of gesture have different effects? 5. Does testing at different points in 
time yield different results? 6. Does producing the gesture during training matter? 
7. Do manual gestures help generalize the use of newly learned words to new 
contexts? Hypotheses on the reasons why gestures would play a positive role for 
word learning are then suggested. 


Keywords: gestures, memory, novel word learning, child language acquisition, 
speech production, speech and language disabilities 


1. Introduction 


Manual gestures are part of communication. We all move our hands and 
arms while we speak and researchers have argued that these gestures 
“are an integral component of the communicative act of the speaker” 
(Kendon, 2004; p. 359). According to the growth point theory, gestures 
and speech stem from a common thought process (McNeill, 1992; 
McNeill & Duncan, 2000; McNeill et al., 2008). They would even be con- 
trolled by the same motor system (Gentilucci & Dalla Volta, 2008). The 
brain integrates both signals when perceiving a communicative act (e.g., 
Ozyiirek et al., 2007) even though the networks involving the processing 
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of the two modalities do not overlap completely (Bernardis, Salillas, & 
Caramelli, 2008). 

Before being able to speak, babies begin communicating intentionally 
using gestures, more specifically pointing gestures. This is probably due 
to the fact that manual gestures are mastered more easily by infants than 
speech (e.g., Goodwyn & Acredolo, 1993). Gestures are indeed holistic 
whereas speech is sequential. From a motor point of view, the hands are 
easier to control than the oral/vocal system required for speaking. One 
could then argue that manual gestures could be used by infants simply 
before they can speak and not actively play a role in speech and language 
development. Iverson and Thelen (1999) proposed a model describing the 
co-development and entrainment of the arm/hand and oral/vocal motor 
systems from birth to around 18 months. The model describes how the 
two systems and their development are closely related. The first gestural 
communicative acts have also been shown to predict the onset of the first 
words (e.g., Iverson & Goldin-Meadow, 2005; Goldin-Meadow, 2007). 
Gesture use is predictive of later vocabulary size (Rowe, Ozcaliskan, & 
Goldin-Meadow, 2008). The type of gesture even predicts the class of 
words acquired (Kraljević, Cepanec, & Šimleša, 2014). Later on, babies 
start combining gestures and words to create utterances and this stage 
has also been shown to be predictive of the first multi-word utterances 
(e.g. Capirci et al., 1996; Goldin-Meadow & Butcher, 2003; Iverson & 
Goldin-Meadow, 2005; Rowe & Goldin-Meadow, 2009). Iverson, Capirci 
and Caselli (1994) showed that, as late as 16 months, the majority of the 
children they observed still had a clear preference for gestural communi- 
cation, even though they had equivalent gestural and verbal repertoires 
(see also, Caselli et al., 2012). From two to 3;6 years of age, it has been 
shown that the production of iconics and beat gestures was correlated with 
language development (Nicoladis, Mayberry, & Genesee, 1999; Mayberry 
& Nicoladis, 2000) and more specifically with verbal vocabulary develop- 
ment (Acredolo & Goodwyn, 1988). 

Taken together, this research suggests that children naturally pro- 
duce manual gestures to communicate and that these gestures play a 
role in language acquisition, not only before speech onset but also later 
(Özçalışkan & Goldin-Meadow, 2005). Purposely encouraging infants 
to communicate using symbolic manual gestures from 11 months of age 
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has been shown to have positive effects on later language development 
(Goodwyn, Acredolo, & Brown, 2000 but see Johnston, Durieux-Smith, 
& Bloom, 2005 for contradictory evidence). Kahn (1981) also tested this 
in “nonverbal, hearing, retarded” children but found the effect to be highly 
dependent on individuals. 

Some studies also show that children use gestures for speech and 
language comprehension (Morford & Goldin-Meadow, 1992). Parental 
gestures help children map meanings on new words (Clark & Estigarribia, 
2011). 18-month-olds manage to interpret gestures and words indifferently 
as labels for object categories, but 26-month-olds seem to have a prefer- 
ence for words (Namy & Waxman, 1998, see also Suanda et al., 2013). 

All this put together suggests that manual gestures play a role in 
acquiring speech and language (Capirci & Volterra, 2008) even though 
it still remains unclear what this exact role is. The aim of this chapter is 
to better comprehend various elements of this role in the specific field of 
word learning. To learn a new word, one has to map both a meaning and 
a lexical form to a concept, which can be respectively labeled as semantic 
and lexical learning. “Word learning is a complex task that requires (...) 
to create new semantic and lexical representations, then link these new 
representations and integrate them with existing phonological, lexical, 
and semantic representations” (Kapalkova, Poligenska, & Siissova, 2016, 
p. 59). 

Gestures have been shown to support different types of learning 
(e.g., Kelly, Manning, & Rodak, 2008; Goldin-Meadow, 2011) and 
long-term memorization (e.g., Church, Ayman-Nolley, & Mahootian, 
2004). More specifically, there is evidence that manual gestures could 
help lexical learning alone. Gestures have indeed been shown to promote 
the learning of words in a foreign language in children (Tellier, 2008 
but see Rowe, Silverman, & Mullan, 2013) and adults (Macedonia & 
von Kriegstein, 2012; Kelly et al., 2014; Macedonia & Repetto, 2016). 
Rowe, Silverman and Mullan (2013) put forward the fact that this effect 
is dependent on the individual. Gogate, Bahrick and Watson (2000) 
observed that mothers naturally use gestures when they teach new words 
to their infant. Evidence also suggests that manual gestures facilitate lex- 
ical access in adults (e.g., Rauscher, Krauss, & Chen, 1996; Krauss & 
Hadar, 1999), 
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This chapter will focus on semantic and lexical learning combined, 
such as in the situation in which a child learns a new word from her na- 
tive language. It will provide a review of the experimental evidence on 
the role of manual gestures in word learning. The first question it will 
address is whether or not the existing evidence suggests a positive role 
of adding manual gestures to spoken words for learning them in both 
typical individuals (section 3.2.) and in people with speech and language 
impairments of various types (section 3.3.). We will then (second ques- 
tion) analyze whether using manual gestures has a differential effect com- 
pared to using other additional cues (section 3.4.). The third question will 
examine whether there are differences between receptive and expressive 
learning in terms of types of effects of adding gestures (section 3.5.). We 
will then question whether the type of gesture has an influence (fourth 
question, section 3.6.). The fifth question will examine whether the effect 
is immediate and if it holds over time (section 3.7.). We will also examine 
whether producing the gesture vs. simply observing it makes a difference 
(sixth question, section 3.8.). Finally, the seventh question will tackle gen- 
eralization of learning (section 3.9.). A discussion will then suggest several 
hypotheses to explain the potential positive effect of manual gestures on 
word learning. 


2. Methodological considerations 
2.1. Terminology and acronyms 


In the following, expressive and receptive learning will be distinguished. 
Expressive (or productive) learning refers to being able to produce the 
learned word upon testing. Receptive (or comprehensive) learning refers 
to being able to comprehend the learned words upon testing. For the 
sake of space and clarity, the following acronyms will be used in the 
text: CI: Cochlear Implant; T21: Trisomy 21; SLI: Specific Language 
Impairment; TD: Typically Developing. 


2.2. Inclusion criteria 


This analysis reviews only articles written in English in order for the reader 
to be able to directly access their content. It explored only journal arti- 
cles. Only experimental studies directly controlling training and testing 


Do manual gestures help the learning of new words? 165 


material and procedures were included. We decided to exclude observa- 
tional studies, even if they describe valuable data: it is indeed difficult 
to evaluate the size of the effects when the material used cannot be con- 
trolled and varies from one participant to the other. No further restric- 
tion was made on methodological aspects e.g., number of participants, 
age, language tested and type of population (typical and clinical). To be 
included, studies could test receptive and/or expressive learning as well as 
generalization. We also chose to review studies both directly evaluating the 
effect of gesture vs. none, those comparing different types of gestures and 
those comparing the use of gesture vs. other additional cues. 


2.3. Exclusion criteria 


Studies analyzing the role of gestures in learning words in a foreign 
language or learning new pseudo-words for already known words were 
excluded from this review in order to be more homogenous in terms of 
cognitive processes involved in the task performed by the participants. 
They are commented on in the introduction. This analysis excluded con- 
ference articles. 


3. Description and analysis of selected studies 
3.1. General description of the sample of studies reviewed 


The final sample of studies reviewed here consists of 19 articles describing 
a total of 20 experimental studies relevant to the topic. Some articles 
describe two studies whereas some studies are analyzed in two articles 
from different points of views. Even though this was not a selection crite- 
rion, all the studies found dealt with children. Fourteen studies included 
TD children and eight included children with various disabilities (DIS) 
involving speech impairments: children with T21 (three studies, N=21), 
children with SLI/Developmental Language Disorder (three studies, N= 
57), deaf and hearing-impaired children (three studies, N=38) and chil- 
dren with cerebral palsy (one study, N=3). Table 1 provides an over- 
view of the characteristics of the populations involved in the studies in 
terms of number of participants and ages. One can note the strong var- 
iability in the number of participants included in the different studies 
ranging from 4 to 120 as well as in their ages ranging from a mean age 
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Table 1: Overview of the populations analyzed in the 20 experimental studies 
in terms of number of participants and age (TD = Typically Developing chil- 
dren; DIS = children with various disabilities involving speech impairments; 
mos. = months; yrs. = years; age in years: yrs.;mos.). 


Number of Age of participants 


participants 

TD DIS TD DIS 
Total 527 119 
Mean 35.1 14.9 44.1 mos. (3;8 yrs.) 82.3 mos. (6;10 yrs.) 
Standard deviation 29.7 9.2 31.9 mos. (2;8 yrs.) 40.3 mos. (3;4 yrs.) 
Minimum 10 4 8.45 mos. 42.3 mos. (356 yrs.) 
Maximum 120 33 128 mos. (1058 yrs.) 158.4 mos. (13;2 yrs.) 


of 8.45 months to 13;2 years. Eleven studies dealt with English (N=401), 
four with Dutch (N=127), three with German (N=100) and one with 
Slovak (N=18). Table 2 provides details on all these studies such as 
number of participants, age, language, experimental design, and type of 
gestures tested. 


3.2. Is there a general advantage of adding gestural 
cues for learning new words in TD children? 


The aim here is to provide a first very general overview of the results of 
the studies concerning the efficiency of adding manual gestures to learn 
new words. A total of eight studies directly compared word learning 
with and without manual gestures (three between-subject designs and 
five within-subject designs) in TD children. As a whole, they tested 
261 participants. Five studies (Capone & McGregor, 2005; Booth, 
McGregor, & Rohlfing, 2008; McGregor et al., 2009; de Nooijer et al., 
2014; Like & Ritterfeld, 2014) involving a total of 212 children put for- 
ward a significant positive advantage of adding manual gestures during 
training to learn new words either expressively, receptively or both. Two 
studies (Bird et al., 2000; van Berkel-van Hoof et al., 2016) involving a 
total of 29 children found no difference between conditions: new words 
were learned equally well expressively and/or receptively whether they 
were trained alone or alongside a manual gesture. There is no clear 
effect of language or age on the effect of gesture on word learning (see 
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table 2). One study (Ting, Bergeson, & Miyamoto, 2012) put forward 
a disadvantage of adding manual gestures: words were learned less well 
when trained with a manual gesture rather than alone. This specific 
study is however quite different from those cited above. It involved 
much younger children (8.5 months) and this implied using specific 
methods very different from those used in the others. During training, 
some infants were familiarized with the target words using videos in 
which they could see a person uttering the words while others saw the 
person speaking and gesturing the words. Upon testing, the infants saw 
videos with a speaker uttering passages including the familiarized words 
vs. ones with the same speaker uttering passages with words not used 
during training. Preference was evaluated through looking durations. 
The infants trained with manual gestures showed no preference for the 
videos with the familiarized words whereas those in the word only con- 
dition did. Even if it was important to include this study in the pre- 
sent review for it to be exhaustive, because of the reasons presented 
above, it was decided to put aside this study when tackling the following 
research questions. One could indeed argue that differences in the effects 
observed could directly result from the great methodological differences 
corollary to involving infants. 

As a whole, the studies reviewed suggest that adding a manual ges- 
ture to the word during training improves word learning performances. 
A potential explanation why some studies found no effect of adding a ges- 
ture to learn new words could be that the participants were children and 
that the gestures used during training were produced by adults. It may be 
the case that, as suggested by de Nooijer and colleagues (2014), gestures 
produced by peer-models would be more efficient. Imitating other’s actions 
may indeed be easier when the actions are modeled by peers of similar 
ages (see Schunk, 1987, mixed results however). It may also be the case 
that children have more facilities identifying themselves with the person 
modeling the action if the latter is a peer (Liuzza, Setti, & Borghi, 2012). 
On the other hand, this explanation is contrary to the fact that language 
acquisition is of course guided by interaction with adults (primarily 
the parents). Note however that as soon as children attend day care or 
school, language acquisition is also largely influenced by communicative 
interactions with peers. 
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Table 2: Summary of the information on the articles reviewed: reference (for the sake 
of conciseness and when there was no ambiguity, all references with more than two 
authors are stated as 1st author et al., year), description of the population, mean age 
of participants (standard deviation, range), experimental design, gesture type tested, 
modality during training (observation and/or imitation), type and number of words 
learned, existing or invented, known or not to the participants, modality during training 
(observation and/or imitation), number of training sessions and frequency, modality 
of recall (expressive and/or receptive learning), testing time: immediate and/or delayed 
(delay after end of training), control of gesture production during recall, summary of 
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(1) - WD+POI 


Exist. or inv. 


5 
s 


inv. 


inv. 


inv. 
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F 
z: 4 
El F 9 
$ È op 
z ` g a 
t PA a= = 
£ 2 3 3 
~ > bb & az 3 
SE E ¢ . a é 
S 3 S É & a > v 
en £ ne E op fa 2 
£ p 6 c = g E & 
6 5 S g EA 3 E £ 
2 F Z ń É & B 4 
no obs. 1 expr + imm no expr.: GAZE~GAZE+POI~ English 
rec.+ + deld. GAZE+POI+T~ 
rec. cat. (3-5 d.) GAZE+POI+T+M~BSL - 
general. imm. > deld. 
rec. & rec. general.: 
GAZE+POI, GAZE+POI+T 
GAZE+POI+T+M > BSL 
no obs. 3/? expr + imm. no expr.: TD > T21 English 
+ free rec. - WD~ARB~WD+ARB 
imit. rec.: T21: WD+ARB>WD 
~ARB - TD: none 
no obs. 3/daily expr.+ imm.+ yes expr.: uncued resp.: English 
rec. deld. (~ (Capone, WD+SHP>WD~WD+FNC - 
9.5d.) 2007) cued resp.: WD+SHP~ 
WD+FNC>WD 
rec.: WD+SHP>WD+FNC~ 
WD-~chance 
no obs. 3/~every expr. deld. (~ no All tests: English 
2d. +cat. 41d.) WD+SHP>WD+POI~ 
general. WD+FNC 
(expr. + 


rec.) 
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Table 2: Continued 


$ aE 2 
= F me E = 
i pas ae 
Ge 5 z=] 5 V 
: E $o È 
| E£ z2 a 5 
I ee aa 5 v . 
= = HE- e E Š 
g © S oa fz] oS = 
g E È Eg 2 z 5 
pa =. © 35 2 © = 
$ he $ F 3 5 f 
ia aS < Sp o) > a 
de Nooijer et al., TD - 53 (31 f.) 8.6 yrs. (0.6) w/in.: DEF - obs. and/ v.f (24) exist. 
2014 DEF+PANT"(6) or imit. 
- DEF+PANT+IMIT 
(6) - DEF+ACT 
Giezen et al., CI - 8 (2 f.) 6;11 yrs. w/in.: WD - ARB obs. n. (8) inv. 
2013 (study 2) prelingually (9 mos. (8) - SSS 
deaf - 5;9-8;1) 
Kapalková et al, TD - 18 (12 f.) 2 yrs.(24-34 btw.: WD+ICO# (10) obs. + ? (10) in. 
2016 mos.) - WD+PIC imit. 
Kohl et al., 1979 H -4 (? f.)3 CP, 13.2 yrs. w/in.: WD - PartSgn obs. n. (18) exist. 
17T21 (11.1-16.1) - CompSgn + v. (6) 
+ prep. 
(6) 
Lüke & TD -20 (5 f.)  4;9 yrs. w/in.: WD+ICO (3) - obs. n. (9) inv. 
Ritterfeld, 2014 (3;4-5;11)  WD+ARB (3) - WD 
(study 1) 
Lüke & SLI - 20 (7 f.) 457 yrs. btw.: WD+ICO obs. n. (9) inv. 
Ritterfeld, 2014 WD+ICO:10 (3;4-5;7) (9) - WD 
(study 2) (5 f., 4 bil.) 
WD: 10 (2 f., 
5 bil.) 
McGregor et al., TDi -40 (21 f.) 1;8-2;0 yrs. btw.: WD - obs. Under exist. 
2009 WD: 13 (7f) WD: 20.68  WD+IC© (1) - 


WD+G: 12 mos. (0.95) WD+PHO 
(8 f£.)WD+P: 15 WD+G: 21 
(5 f.) mos. (1.54) 

WD+P: 21.26 

mos. (1.38) 


Words known? 


B 
© 


no 


no 


no 


no 


no 


no 
except 
4 

part. 


Word modality for training 


° 
ion 
n 


obs. 


obs. 


obs. 


obs. 
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No. of training ses./Frequency 


2 / daily 


15/4 
times a 


wk. 


15 / daily 
(Mond. 
- Sat.) 


3 / weekly 


Recall modality 


Q 
oO 
ph 


recall + 
rec. 


rec. 


expr. 


expr. + 


expr. + 


rec. + 
general. 


Testing time 


5 
=| 


imm. 


deld. 
(T1:1 
d.; T2: 2 
wk.; 

T3: 6 
wk.) 


imm. 


imm. 


imm. / 
deld. (1 
wk.) 


imm. 
/ deld. 
(2-3 d.) 


Gesture prod. during recall 


5 
© 


no 


yes 


no 


no 


Summary of results 


def. recall: DEF+PANT>DEE, 
DEF+ACT for loc. v. (vs. abs. 
& obj.) 

rec.: none 

WD~ARB~SSS 


WD+ICO>WD+PIC 
T1>T3 - T1~T2 - T2~T3 


expr.: for 1 part. only 
PartSgn, CompSgn > WD 
rec.: CompSgn~PartSgn>WD 


expr.: none, no correct 
labelings 

rec.: WD+ICO~WD+ARB> 
WD 


imm.: expr. & rec.: none 
deld.: T1 & t2: expr.: 
WD+ICO>WD rec.: none 


rec.: 
WD+ICO~WD>WD+PHO 
- imm.~deld.>pre-test 

rec. general.: 
WD+ICO~WD>WD+PHO 
- deld.>imm.~pre-test 
WD+ICO: deld.>pre-test, 
deld.~imm., imm.~pre-test 
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Language 


Dutch 


Dutch 


Slovak 


English 


German 


German 


English 
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Table 2: Continued 


Reference 


Mollink et al., 
2008 


Mumford & 
Kita, 2014 


O'Neill, 
Topolovec, 

& Stern- 
Cavalcante, 2002 
(expe. 1) 

O'Neill, 
Topolovec, 

& Stern- 
Cavalcante, 2002 
(expe. 2) 

Romski & Ruder, 
1984 


Ting et al., 2012 


) 


Population — no. of part. (no. 


a 
Ge 
© 


HI - 14 (10 f.) 
All wore 
hearing aids 


TD - 120 (57 f.) 
WD+MG: 36 
WD+ESG: 32 
WD: 33 

TD - 40 (22 f.) 
DES: 20 (12 f.) 
POI: 20 (10 f.) 


TD - 32 (16 f.) 
DES: 16 (8 f.) 
POI: 16 (8 f.) 


T21-10(?f.) 


TD - 20 (? f.) 
WD+SGN: 10 
WD: 10 


Marion Dohen 


Age of part.: m. (sd. - rng.) 


5311 yrs.(13 
mos. - 4;4-8;3) 


41.48 mos. 
(3.13 — 36-47) 


DES: 33.9 
mos.(0.98, 
32-35) 

POI: 34.8 mos. 
(1.11, 33-36) 
DES: 39.8 
mos.(1.76, 
37-43) 

POI: 40.6 mos. 
(1.75, 37-43) 
5:7 yrs. 

(14.94 mos., 
3:11-7:10) 


Experimental design (no. of 
gestures presented if relevant) 


w/in. 


: CTL- WD 


- WD+SGN' (16) 
- WD+CLR™ 


btw.: WD - WD+MG 


(5) - W+ESG (5) 


btw.: 


WD+ICO" 


(5) - WD+POI 


btw.: 


WD+ICO° 


(4) - WD+POI 


w/in.: CTL’ - WD - 


WD+SGN‘ (4) 


WD+SGN: 8.5 btw.: WD 
mos. (1, 7-9.5) - WD+SGN" (4) 


WD: 8.4 mos. 
(0.93, 7.1-9.5) 


ob 

& 

=! > 

5 g 

= a 

£ a 

& A 

T 3 

ae} v . 

È E 

2 z 5 

l z á 

obs. n. (64) exist 

obs. v. (5) inv. 

obs. adj. (5) exist. 
adj.? (4) 

obs. + n. (12) + exist. 

enactment v.‘ (12) 

of actions 

on obj. 

obs. n. (4) exist. 
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g 
gE 3 F 
2 § 3 
z é se 
a £ 2 
£ A 5 3 
2 3 E 3 £ 3 z v 
Br = E fz] = a op 
CA E Ge on g g Q 
3 z] © i R- E £ 5 
p p : oS 
5 8 ő 3 3 3 E 5 
B > Z cA H oO D 4 
no obs. + 3/weekly expr. deld. no condition: WD+SGN> Dutch 
imit. (T1:1 WD+CLR~-WD>CTL 
wk; test time: T1>T2 
T2:5 iconicity: T1: strong~weak - 
wk.) T2: strong>weak 
no obs. 1 rec. imm. no WD+MG>WD~WD+ESG English 
general. 
var. obs. 1 rec. imm. yes Tendency towards English 
general. WD+ICO>WD+POI 
no WD+ICO>WD+POI 
not obs. m.=23.1 expr. deld.(?) yes expr.: few resp. English 
(rng.=10- + rec. - WD~WD+SGN 
48)/daily + rec. rec.: not sig. but ad. for 5 
(weekdays) general. part. dis. for 2 part. none for 
3 part. 
expr. general.: 
WD>WD+SGN 
rec. general.: WD+SGN>WD 
- rec.>expr. 
no obs. 1 rec. imm. no looking time: WD+SGN: English 
trained~untrained 


WD: trained>untrained 


(continued on next page) 
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Table 2: Continued 


g = sF a 
R 4 $8 E 7 
: ac z È 
Te m gH © v 
: = E $o È 
| E£ z2 a 5 
l 5 EE E £ 
g E 5 p E $ 
5 E $a Eg g P 
w 3a © 5 8 5 5 
$ Ba $ àg g 5 
a as < bp io) > 
van Berkel-van 2 (25 £.)HI" 10;8 yrs. (8.47 w/in.: WD - obs. + n. (20) 
Hoof et al., 2016 - “1 (2 f.) SLI“ - mos., 9-11) WD+ICO* (10) imit. 
a £)TD -19 
( f.) 
Vogt & SLI - 20 (10 f.) SLI: 4;6 w/in.: WD+ICO (var.) obs. n. 
Kauschke, 2017a TD AM - 20 yrs. (037) - WD+ATT’ (1) (var.)? + 
(10f.)TDLM TD AM: 4;5 v. (var.)™* 


-20(11f.)No yrs. (0;3)TD 
exposition to LM: 333 yrs. 
gesture or sign (0316) 


Vogt & 
Kauschke, 2017b 


Abbreviations (in alphabetical order): ad.: advantage — adj.: adjective — beg.: beginning 

— bil.: bilingual — btw.: between-subject design — cat.: category — d.: day — def.: definition — 
deld.: delayed — dis.: disadvantage — exist.: existing — expe.: experiment — expr.: expression 
- f.: female — general.: generalization — imm.: immediate — imit.: imitation — inv.: invented 

- m.: mean — mo.: month — mos.: months — n.: noun — no.: number — obj.: object — obs.: 
observation — part.: participant — prep.: preposition — prod.: production — rec.: reception — 
rng.: range — sd.: standard deviation — ses.: session — sig.: significant — v.: verb — var.: variable 
— w/in.: within-subject design — wk.: week — yr.: year — yrs.: years. 

Acronyms (in alphabetical order): AM: Age-matched group (individually matched in 
chronological age (+/- 9 mos.) and gender) — CI: Cochlear Implant (prelingually deaf) — 

CP: Cerebral Palsy — HI: Hearing Impaired - T21: Trisomy 21 - H: Handicapped — LM: 
Language-matched group (individually matched on grammar comprehension, receptive and 
expressive vocabularies (nouns and verbs), word definition and nonword repetition (scores 
+/- 1/2 sd)) — SLI: Specific Language Impairment — SLN: Sign language of the Netherlands — 
TD: Typically Developing. 


Exist. or inv. 


me 
B 


V. 


Words known? 


=] 
jo} 


no 


Word modality for training 


obs. 
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No. of training ses./Frequency 


3/ w/in. 1 


3 / every 
2-3 d. 


Recall modality 


= 
La] 


C. 


expr. + 
rec. 


def. 
prod. 


Testing time 


(T1) and 
3 (T2) 

+ 1 ses. 
within 


same wk. 


(T3) 
imm. 
after 1st 
training 
ses. (T1)/ 
deld. 
(T2, 2-3 
d.) 


deld. 
(2-3 d.) 


Gesture prod. during recall 


5 
[e] 


no 


E 
g 
Ma 
[=] 
my So 
: : 
: 
n = 
HI: WD+ICO>WD - Dutch 


T1<T2<T3 - ad. gets larger 
over time 

SLI: WD+ICO~WD 

- T1<T2<T3 

TD: same as SLI 


expr.: WD+ICO>WD+ATT German 
- T2>T1>pre-test 

T1: v.: WD+ICO>WD+ATT, 

n.: WD+ICO~WD+ATT 

T2: v.: WD+ICO~WD+ATT, 

n.: WD+ICO>WD+ATT 

rec.: WD+ICO>WD+ATT 
-T2>T1>pre-test 


General tendency: 
WD+ICO>WD+ATT 
- LM<SLI~AM 


(continued on next page) 
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Table 2: Continued 


Experimental conditions: ACT: action enactment (part. asked to create a ges.) - ARB: 
arbitrary sign (no iconic resemblance with referent) - ATT: attention-directing gesture 

— BSL: baseline - CompSgn: complete signing (all words are signed), signs from Signed 
English Dictionary, 1 sign for each word — CTL: control (no training: only pre-test and 
post-test) — DEF: verbal definition — ESG: end-state gesture (depicts shape or lines formed 
by action) — FNC: function gesture (dynamic symbol) — ICO: iconic gesture — IMIT: 
gesture imitation - MG: manner gesture (depicts action of the hand) — PANT: pantomime 
— PartSgn: partial signing (signing of keywords only) - PHO: photo — PIC: picture — POI: 
pointing at target - SGN: sign —- SHP: shape gesture (static symbol) — SSS: Sign Supported 
Speech - WD: word 

a GAZE: Experimenter gazes at target — T: Experimenter additionally extends arm till touches 
object — M: Experimenter additionally pushes object across the table — BSL: Experimenter looks 
at table midway from target and foil 

> Group matching: mental age - T21 trained in manual signs - TD no 

€ Signs produced with both hands symmetrically 

4 Subgroup of Capone & McGregor (2005) 

€ Gesture observation only 

f Verbs of 3 types: locomotion (loc.), object-manipulation (obj.), abstract (abs.) 

8 Gestures based on Slovak Sign Language 

^ Tconics constructed or adopted from German Sign Language (visible feature of characters’ 
head or neck) 

i Group matching: age, sex, bilingualism 

i Group matching: chronological age, total number of words, number of spatial terms 

k Experimenter holds right hand over left and moves right hand under the left 

' Signs from the Sign Language of the Netherlands — 2 sub-conditions: strong and weak 
iconicity 

™ Experimenter names a color in addition to naming the picture 

" Demonstrating property described by adjective, performed on toy 

° Demonstrating property described by adjective, performed on toy 

P Adjectives describing tactile properties applied to animal names familiar to the 
participants 

a Control for natural acquisition of the trained words (only pretest and postest) 

t Signs from Signing Exact English (Gustason, Pfetzing, & Zawolkow, 1975) 

` Transitive verb, representing object manipulation 

t Individualized selection for each participant to select unknown words 

" Signs from Signing Exact English (Gustason, Pfetzing, & Zawolkow, 1975) — 2 signs out 
of 4: iconic; 2 other arbitrary 

“ Hearing aids of different types — exposed or users of the SLN 

w Some exposed to SSS 

x Gestures invented accordingly to SLN formational principles and depicting “a defining 
feature” of the character 

” Attention-directing gesture: raised forefinger in front of upper body 

* Shape of animal — no. of nouns taught: TD AM: 6; SLI & TD LM: 4 

a Manner and/or path of movement — no. of verbs taught: TD AM: 6; SLI & TD LM: 4 
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3.3. Is there a general advantage of adding gestural 
cues for learning new words in children with 
speech and language deficits of various types? 


A total of seven studies directly compared word learning with and 
without manual gestures (one between-subject design and six within- 
subject designs) in children with speech and language difficulties. They 
tested a total of 99 children. Giezen, Baker and Escudero (2013; N=8) 
found no effect of adding a manual gesture for word learning in children 
with CI. Van Berkel-van Hoof and colleagues (2016; N=17) found the 
same result for children with SLI. All other studies found an advantage 
of using manual gestures for learning new words, whether it be expres- 
sively or receptively or both, in a total of 74 participants (children with 
T21: Kohl, Karlan, & Heal, 1979; Bird et al., 2000; Romski & Ruder, 
1984 — children with SLI: Like & Ritterfeld, 2014 — children with hearing 
impairments: Mollink, Hermans, & Knoors, 2008; van Berkel-van Hoof 
et al., 2016 — children with cerebral palsy: Kohl, Karlan, & Heal, 1979). 
Among the latter studies, Bird and colleagues (2000) and van Berkel-van 
Hoof and colleagues (2016) also included a group of TD children for 
which they found no effect of adding manual gestures to the learning of 
new words. The lack of a positive effect in Giezen, Baker and Escudero 
(2013; children with CI) could be explained by the fact that they used 
only one training session and only immediate and no delayed testing. The 
number of taught words was also very important (64) which could result 
in a floor effect. The authors interestingly put forward that, even if there 
was no positive effect, there was no negative effect either. Using manual 
gestures thus did not interfere with word learning. 

The results of the reviewed studies therefore suggest that manual 
gestures could help children with speech and language difficulties learn 
new words, maybe even more so than for TD children. Van Berkel-van 
Hoof and colleagues (2016) actually found a positive effect of gestures 
for word learning only for hearing impaired children (vs. TD children and 
SLI children). They hypothesize that “Because these children are bimodal 
bilinguals, they process augmentative signs through the phonological loop 
as they do speech” (p. 346). Gestures may be more effective to help learn 
new words when the participants are used to using and/or seeing gestures. 
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Bird and colleagues (2000) indeed found a positive effect of signs for word 
learning in children with T21 familiar with signing but not in TD children. 
Van Berkel-van Hoof and colleagues (2016) also found a positive effect of 
iconic gestures for children with hearing impairments familiar with sign 
language whereas they found no such effect in children with SLI and TD 
children. Note however that this hypothesis is not backed by the fact that 
a number of other studies did find positive effects of gestures for word 
learning in TD children who had never been exposed to signs. There may 
be interferences with other factors such as age and duration of training, 
but these are difficult to analyze because of the variability in methodolo- 
gies used and populations tested. A crucial point is that none of the studies 
found a negative effect of adding manual gestures for word learning: either 
there was no effect or a positive one. 


3.4. Are manual gestures more (or less) efficient than 
other additional cues for word learning? 


It could be the case that providing any additional cue, whether it be a 
manual gesture or something else, could improve word learning. This 
section examines in more detail the studies comparing the effect of using 
manual gestures to that of using other additional cues. Booth, McGregor 
and Rohlfing (2008) compared several conditions: 1. using pointing to the 
object to learn its label; 2. additionally touching it; 3. additionally moving 
it across the table (in TD children). There was no advantage of the two 
latter conditions compared to the former which all yielded similar positive 
effects for receptive word learning compared to a word alone condition. 
This suggests that the advantage solely emerged from using a pointing ges- 
ture since adding other cues did not further improve the effect. 

Kapalkova, Poligenska and Siissova (2016) analyzed expressive word 
learning in two groups of TD participants: one of them learned the words 
alongside manual gestures and the other with pictures. Even though 
participants managed to learn the new words in both groups, performances 
were significantly better in the gesture group. 

McGregor and colleagues (2009) compared the learning of the prep- 
osition ‘under’ in three groups of TD children: one with the word only, 
one with an additional manual gesture, and one with a photograph. The 
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results show that performance improved from pre-test to immediate testing 
after training to a similar extent in all groups. A further analysis, however, 
showed a manual gesture advantage when comparing performance at pre- 
test and delayed post-test. This suggests that manual gestures promote 
learning more than photographs, not immediately after training, but after 
a two- to three-day delay. Manual gestures would thus be more efficient 
for maintaining word learning over time. 

Mollink, Hermans and Knoors (2008) compared adding a sign to the 
spoken word during training to adding a color (labeled verbally by the 
experimenter during learning) to providing the word alone in children with 
hearing impairments. They found that receptive learning performances 
were better for the word + sign condition than for the two other conditions 
and not different for the two other conditions. This suggests that adding a 
color does not help promote word learning more than learning the word 
alone whereas adding a gesture does. 

The results of the above four studies suggest that adding a manual ges- 
ture to the word during training is not equivalent to adding any other cue. 
It appears that the gesture plays a different role than other additional cues 
such as pictures. 


3.5. Is there a differential effect of gesture on 
expressive vs. receptive learning? Interaction 
with number of training sessions 


As stated in section 2, two types of learning can be distinguished: expres- 
sive and receptive learning. The aim of this section is to differentially 
examine the effect of adding manual gestures to the learning of new words 
for expressive and receptive learning. 

Out of the seven studies cited to address the question in section 3.2. and 
directly comparing the use of gesture vs. none for word learning in TD 
children, four tested receptive learning only (McGregor et al., 2009; Ting, 
Bergeson, & Miyamoto, 2012; de Nooijer et al., 2014; van Berkel-van 
Hoof et al., 2016) and four evaluated receptive and expressive learning 
(Bird et al., 2000; Capone & McGregor, 2005; Booth, McGregor, & 
Rohlfing, 2008; Like & Ritterfeld, 2014). 

Concerning receptive learning, four studies (Capone & McGregor, 
2005; Booth, McGregor, & Rohlfing, 2008; McGregor et al., 2009; Like 
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& Ritterfeld, 2014) put forward an advantage of using gestures (vs. none) 
to learn new words, whereas the three others (Bird et al., 2000; de Nooijer 
et al., 2014; van Berkel-van Hoof et al., 2016) found no effect. The number 
of training sessions does not seem to explain the fact that some studies 
found no effect: two of the studies finding no effect used three training ses- 
sions and one of them used two, whereas three studies finding a positive 
effect of using manual gestures used only one training session and another 
study only three sessions. Another hypothesis to explain differences in the 
results could be the number of words learned. de Nooijer and colleagues 
(2014) taught 24 words to the participants and van-Berkel-van Hoof and 
colleagues (2016) 20 and found no effect, whereas all other studies finding 
a positive effect of adding manual gestures to learn new words taught 
between one and nine words. 

Recall that all the studies testing expressive learning also analyzed recep- 
tive learning. Only one study (Bird et al., 2000) found no effect of gestures 
on both expressive and receptive learning. Booth, McGregor and Rohlfing 
(2008) as well as Liike and Ritterfeld (2014) found that whichever condi- 
tion (gesture or none), the participants did not manage to learn the new 
words expressively, even though they did receptively with an advantage 
for the gesture condition. As stated by Booth, McGregor and Rohlfing 
(2008), this may be due to insufficient training (only one session in both 
studies). Expressive learning would thus require more training than recep- 
tive learning. This hypothesis is corroborated by the results of Capone 
and McGregor (2005) who found a positive effect of iconic gestures 
underlying shape (vs. iconic gestures depicting function and no gesture) 
for learning new words after three training sessions. Note that when the 
participants did not manage to provide an expressive response, the exper- 
imenter provided a gestural cue. For these cued responses, the authors 
found a positive effect of both types of iconic gestures over no gesture. 
In the no gesture condition, the participants did not manage to provide 
any expressive responses even though they managed to learn some words 
receptively just as in Booth, McGregor, & Rohlfing (2008) and Like & 
Ritterfeld (2014). Capone Singleton (2012) obtained similar results com- 
paring shape gestures, function gestures and pointing gestures: shape 
gestures showed an advantage over function and pointing gestures for 
expressive word learning after three training sessions. Overall, these 
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observations corroborate the fact that receptive learning is faster than 
expressive learning and that adding gestures can promote faster expres- 
sive learning even though such learning still takes longer than receptive 
learning. Note however that Vogt and Kauscke (2017a) found an advan- 
tage of iconic gestures over attention-directing gestures even at the end 
of the first training session for expressive learning. Booth, McGregor and 
Rohlfing (2008) also suggest that the lack of a positive effect of gestures 
on expressive learning may be due to the fact that the participants were 
not asked to produce the words during training (also the case in Like & 
Ritterfeld, 2014 and Capone & McGregor, 2005). All these arguments are 
further corroborated by the study by Kapalkova, Polišenská and Siissova 
(2016) who found positive effects of using iconic manual gestures (vs. pic- 
ture support) on expressive word learning after 15 sessions in a paradigm 
in which participants produced the words during training. 

Out of the seven studies addressing the effect of adding a manual ges- 
ture on learning new words in children with speech and language diffi- 
culties (section 3.3.) and directly comparing the use of gesture vs. none, 
two tested receptive learning only, one expressive learning only and four 
both receptive and expressive learning. All studies, except Giezen, Baker, 
& Escudero (2013) and Liike & Ritterfeld (2014), found a positive effect 
of adding manual gestures to the learning of new words receptively for at 
least one group of children with speech and language impairments. Note 
however that the effect was not significant in Romski and Ruder (1984): it 
was positive for only five out of 10 children with T21. Van Berkell-van 
Hoof and colleagues (2016) found a gestural advantage for receptive 
learning for children with hearing impairments but not for those with SLI. 

Bird and colleagues (2000) found no effect of adding manual gestures to 
the learning of new words expressively in children with T21 even though 
they did for receptive learning. Liike and Ritterfeld (2014) and Mollink, 
Hermans and Knoors (2008), on the other hand, report a positive effect 
respectively in children with SLI and hearing impairments. The effect is 
also positive in Kohl, Karlan, and Heal (1979) for children with T21 or 
cerebral palsy but it does not reach significance. Results from Romski and 
Ruder (1984) are unclear. It appears that a positive effect for expressive 
learning was obtained more often in children with disabilities compared 
to TD children (see above). To build on the discussion above, it is also the 


182 Marion Dohen 


case that the studies dealing with children with disabilities often included 
more training sessions than those dealing with TD children. 

In a nutshell, the effect of manual gesture on receptive learning appears 
to be influenced by the number of words taught: when too many words are 
taught no advantage for gestures appears. Concerning expressive learning, 
some studies find no effect of gesture, but this is mainly due to the fact that 
the children did not manage to learn the words expressively whichever 
the condition mainly because of insufficient training (floor effect). With 
more training, children managed to learn new words expressively with an 
advantage when manual gestures were present during training (though see 
Vogt & Kauscke, 2017a, for an advantage of gesture on expressive word 
learning after only one training session). It seems that the positive effect 
of manual gestures on expressive word learning is greater for children 
with speech and language impairments though the studies involving such 
participants generally included more training sessions than those with TD 
children. 


3.6. Does gesture type matter? Does iconicity matter? 


The aim of this section is to analyze whether the experimental evidence 
puts forward a specific advantage of adding different types of manual 
gestures to the learning of new words. Specifically, one could hypothesize 
that if the gesture puts forward an iconic resemblance with the referent 
it could be more beneficial for learning the new word labeling the ref- 
erent: “if a sign is more iconic than the spoken word, its form conveys 
information about a word’s meaning and may thus assist a child in map- 
ping new words to meanings” (Bird et al., 2000, p. 260). 

Like and Ritterfeld (2014) found that both iconic and arbitrary gestures 
were equally beneficial to word learning in TD children aged 4;9 years. 
This could appear as contrary to the results of Namy and Waxman (1998) 
who compared the ability of children from 18 to 26 months to learn either 
word or arbitrary gestural labels. They found that 18-month-olds learned 
word or gestural labels indifferently whereas 26-month-olds learned word 
labels more easily and needed extra training to learn gestural labels. Bird 
and colleagues (2000) also found that 21.8-month-olds were not able to 
expressively learn arbitrary gestural signs alone as labels. Marentette and 
Nicoladis (2011) found that children aged 40 to 60 months could learn 
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iconic gesture labels for objects but not arbitrary gesture labels. This dif- 
ference in findings could suggest that, even if children have difficulties 
learning only an arbitrary gestural label (without a word) for an object, 
the arbitrary gesture could however help them learn a corresponding and 
simultaneously presented word (at least receptively). Like and Ritterfeld 
(2014) hypothesize that “arbitrary gestures may have enhanced the 
interest of the child in the presented words in contrast to words intro- 
duced without gestures”. This finding is however contradictory to that of 
Bird and colleagues (2000) who found no difference between a word only 
and a word + arbitrary sign condition on expressive and receptive word 
learning (mean age = 21.8 months). This difference in findings could be 
due to the age of the participants: the ones in the Bird et al. (2000) study 
may have been too young to manage to learn arbitrary signs even though 
Namy and Waxman (1998) found that 18-month-olds can learn words 
and arbitrary gestural labels. This hypothesis is backed by the fact that 
Bird and colleagues (2000) did find a beneficial effect of arbitrary signs 
for receptive learning of new words in older children with T21 (mean 
age = 42.3 months). Note however that the participants with T21 were 
trained in using signs prior to the study whereas TD children were not. 
Giezen, Baker and Escudero (2013) found no benefit in adding arbitrary 
signs to learn new words in 6;11-year-old children with CI. The scores 
were however close to 100 % and the absence of a gesture benefit could 
be due to a ceiling effect. 

Another surprising observation, taking into account the results of 
Marentette and Nicoladis (2011), is that Like and Ritterfeld (2014) found 
no advantage of iconic gestures over arbitrary ones. This could be due to 
a ceiling effect for iconic gestures (as suggested by the authors themselves). 
Vogt and Kauschke (2017a) ran an interesting follow-up experiment to 
their main study comparing iconic and attention-directing gestures. It 
compared the use of iconic gestures versus arbitrary ones. Even though 
sample size was small (18 TD children) and impeded reaching statistical 
significance (according to the authors themselves), the results suggest that 
both expressive and receptive performances were higher for the iconic 
than the arbitrary gesture condition. 

Mollink, Hermans and Knoors (2008) found a positive effect of adding 
signs to words for receptive word learning in hearing impaired children. 
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They analyzed this effect as a function of sign iconicity. Even though the 
effect was the same for signs with strong iconicity than for those with 
weak iconicity one week after training, the results after five weeks show 
that learning was better for strongly iconic signs. The interesting thing is 
that at five weeks, the performance decreased compared to one week only 
for weakly iconic signs but remained the same for strongly iconic signs 
suggesting that iconicity helped longer memory retention. This finding is 
corroborated by that of van Berkel-van Hoof and colleagues (2016) who 
found a positive effect of adding iconic signs for receptive word learning 
in hearing impaired children. Note however that they did not find a bene- 
ficial effect of iconic gestures over none in 10;8-year-old TD children and 
children with SLI. Liike and Ritterfeld (2014) did find a beneficial effect of 
iconic gestures over none in 4;7-year-old children with SLI. 

Capone and McGregor (2005) and Capone Singleton (2012) com- 
pared the effect of using iconic gestures underlying shape to ones under- 
lying function. They found that shape gestures were more efficient in 
promoting expressive and receptive word learning than function gestures. 
This suggests that type of iconicity could be as important as iconicity itself. 
In both studies the shape gestures were static symbols and the function 
gestures were dynamic symbols. Even if performances on words trained 
with a function gesture were generally not better than those for words 
trained with no gesture, the authors put forward an interesting finding. In 
expressive learning testing, when the participants did not manage to pro- 
duce the word, the experimenter provided the gestural cue. In these cases, 
function gestures functioned as good as shape gestures to help the children 
produce the new words upon testing, suggesting that function gestures 
may have a beneficial effect even though they are not as effective as shape 
gestures. Capone Singleton (2012) speaks of a shape bias already put for- 
ward by other researchers (Kemler Nelson et al., 2000; Smith et al., 2002). 
Mumford and Kita (2014) also compared different types of iconic gestures 
for learning new verbs receptively: ones underlying the manner of an ac- 
tion (dynamic) and the other its resulting end state (shape, static). Contrary 
to the above-mentioned studies, the authors found a beneficial effect over 
word only training solely for the dynamic manner gestures and not for the 
static end state gestures. This may be due to a difference in the ages of the 
participants: the participants were aged 41.48 months on average whereas 
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those in the Capone studies were aged around 28 months. Actually, it 
appears that the shape bias mentioned above would wear off with age 
(Imai, Gentner, & Uchida, 1994). Also note that in the Mumford & Kita 
(2014) study, the children learned verbs whereas they learned nouns in 
the Capone studies. Vogt and Kauscke (2017a) compared iconic gestures 
underlying path and/or manner to ones underlying shape. They found 
a larger advantage for path-manner than shape gestures for immediate 
learning of verbs but similar effects for both gesture types for immediate 
learning of nouns. On the other hand, after a two- to three-day delay from 
the end of training, they found a larger advantage for shape than path- 
manner gestures for nouns but similar effects for both gesture types for 
verbs. This result helps understand the differences between the Mumford 
& Kita study and the Capone studies. 

Vogt and Kauscke (2017a) found an advantage of iconic gestures over 
attention-directing gestures (raised forefinger in front of upper body) even 
at the end of the first training session. The authors conclude that: “it is the 
iconicity of the gestures (that is the resemblance to the referent), rather 
than the item-specific encoding of both auditory and visual information to 
a lexical form, that helps learning” (p. 22). O’Neill, Topolovec and Stern- 
Cavalcante (2002) analyzed the generalization of the use of newly learned 
adjectives to qualify other objects than those used during training with 
a similar distinctive quality referred to with the adjective. For example, 
during training, the children were presented with a ‘lumpy cat’ and taught 
the adjective ‘lumpy’. Upon testing they were presented with a ‘lumpy 
turtle’ and a ‘smooth turtle’ and asked to designate the ‘lumpy’ one. During 
the learning phase, some adjectives were learned with a descriptive gesture 
and others with a pointing gesture. In a first experiment, they found no 
difference between the two conditions except for the adjectives describing 
non-visual properties of objects (descriptive gesture advantage) and con- 
cluded that: “gesture may play a more important role in the learning of 
less visually detectable properties” (p. 255). In a second experiment, using 
lower frequency adjectives describing only non-visible properties, they 
found that descriptive gestures were more efficient in promoting learning 
than pointing gestures. An interesting finding is a higher frequency of men- 
tion of nontarget properties (resp. less expression of uncertainty during 
testing) by the participants in the pointing than in the descriptive gesture 
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condition, but only in experiment 1. Capone Singleton (2012) compared 
iconic gestures underlying shape to pointing. She found that shape gestures 
were more efficient in promoting expressive and receptive word learning 
than pointing. Booth, McGregor and Rohlfing (2008) however found that 
the use of pointing by the experimenter during training yielded better 
receptive learning than not using any gesture. It may therefore be the case 
that attention-directing gestures are also beneficial for word learning, but 
less than iconic gestures. 

Overall, the studies reviewed here suggest that different types of manual 
gestures can have differential effects. Pointing gestures appear to be 
helpful for word learning compared to no gesture even though there is 
some evidence that gestures having an iconic resemblance with the ref- 
erent would be more effective. “the use of descriptive gestures during the 
teaching of novel adjective terms appears (...) to have helped children to 
isolate the particular property intended by the speaker in a manner not 
possible when point gestures were used instead” (O’Neill, Topolovec, & 
Stern-Cavalcante, 2002). 

Results of the studies comparing arbitrary and iconic signs do not all 
agree but do suggest in general that iconics are more beneficial. Several 
studies also put forward differences between iconic gestures with a bias of 
shape over function for nouns and path-manner over shape gestures for 
verbs. Capone and McGregor (2005) suggest that the role of gestures is 
to draw “attention to an important aspect of the word learning problem 
(shape, function or both), thereby reinforcing salient semantic content 
of the spoken language” (p. 1478). As suggested by Vogt and Kauschke 
(2017a), iconic gestures may facilitate the association with the lexical 
form. Type of iconic gesture could interact with type of word learned (e.g., 
nouns vs. verbs) and also the word itself. For example, one could hypothe- 
size that some words are better represented by static shape iconic gestures 
(e.g., hands shaped as a round to illustrate the word ‘ball’) and others by 
dynamic function gestures (e.g., fingers miming cutting to illustrate the 
word ‘scissors’). 


3.7. Does testing time matter? 


Some of the studies reviewed tested the participants immediately after 
training, others after various delays and others at both times. Immediate 
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testing tackles fast mapping, the initial stage of word learning “in which a 
first connection of a word and referent is retained (Carey, 2010; Carey & 
Bartlett, 1978)” (Liike & Ritterfeld, 2014, pp. 203-204). Delayed testing 
examines slow mapping, the retention of meaning and label association 
in memory when the “child forms a robust and more sophisticated lex- 
ical representation of the word (Carey, 2010; Horst, Parsons, & Byron, 
2011)” (Like & Ritterfeld, 2014, p. 204). It is possible that a gestural 
effect could appear both at the stages of fast and slow mapping or only 
after some time as observed in other learning tasks (e.g., Cook, Mitchell, 
& Goldin-Meadow, 2008; Cherdieu et al., 2017). 

A total of five studies (iconic gestures: Capone & McGregor, 2005; Liike 
& Ritterfeld, 2014; Mumford & Kita, 2014; Vogt & Kauschke, 2017a; 
pointing: Booth, McGregor, & Rohlfing, 2008) put forward a beneficial 
effect of using iconic gestures to fast map new words receptively but not 
expressively (except Vogt & Kauschke, 2017a) in TD children (no gestural 
benefit: Bird et al., 2000). 

Like and Ritterfeld (2014) found similar results for children with SLI 
(positive effect of gestures on receptive word learning). Bird and colleagues 
(2000) found the same for children with T21 and Kohl, Karlan and Heal 
(1979) for one child with T21 and three children with cerebral palsy. 
Giezen, Baker and Escudero (2013) however found no positive effect of 
using gestures during training for immediate receptive word learning in 
children with CI. Note that they used arbitrary signs whereas Like and 
Ritterfeld (2014) used iconic gestures (see section 3.6. for discussion on 
this topic). Bird and colleagues (2000) also used arbitrary signs but the 
children were exposed five times more to each label than those in the 
Giezen, Baker, & Escudero (2013) study. Finally, note that the testing 
scores in the Giezen, Baker, & Escudero (2013) study were relatively high 
(above 80 % correct responses) suggesting a possible ceiling effect. 

Booth, McGregor and Rohlfing (2008) found a positive effect of 
pointing gestures on receptive word learning in TD children both immedi- 
ately and after a three-to five-day delay. The same was obtained for iconic 
gestures by Capone and McGregor (2005; 11.5-day delay). Kapalkova, 
Polišenská and Siissova (2016) found an effect of testing delay on gen- 
eral expressive word learning, all conditions (picture vs. iconic gesture 
support) put together with no interaction. Note however that all testing 
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sessions were delayed (one day after end of training vs. two weeks and 
six weeks), performances being better after one day than after two or six 
weeks. McGregor and colleagues (2009) however find a larger effect of 
gesture (over speech only) on the receptive acquisition of the preposition 
‘under’ only after two to three days and not at immediate testing and only 
for generalization (not for trained pairs of objects). Note however that in 
all conditions including the speech only condition, the experimenter mod- 
eled the ‘under’ relationship on objects during training. Even though this is 
not a manual gesture per se, it may act as a gesture, which would explain 
the results. The authors also analyzed the correlation between short-term 
and long-term performances and found, only for the gesture group, that 
“children who demonstrate modest gains on the immediate post-test build 
on those gains for a more impressive performance at delayed post-test” 
(p. 819) and this only for unlearned combinations: “The gesture advan- 
tage was revealed by the children’s ability to follow under instructions 
given the untrained generalization items” (p. 820). Like and Ritterfeld 
(2014) found the same result in children with SLI, the positive effect of 
gesture only emerging for expressive (and not receptive) learning after a 
one-week delay. Vogt and Kauschke (2017a) found no effect of condition 
x testing time on performance in expressive and receptive word learning 
in TD and SLI children but this study did not include a ‘no gesture’ con- 
dition, it only compared the use of iconic and attention-directing gestures. 

Van Berkel-van Hoof and colleagues (2016) only used delayed testing 
at several time points and found no advantage for iconic gestures over 
speech in all cases in TD and SLI children even though word learning 
performance improved over time (three testing time points). This improve- 
ment is probably due to the fact that there was extra training between 
testing times. Note that they did, however, find a positive effect of using 
gestures during training in hearing impaired children and that this advan- 
tage increased in magnitude over time. Mollink, Hermans and Knoors 
(2008) also found positive effects of adding signs to words for receptive 
word learning in hearing impaired children one week and five weeks after 
the end of training (no fast mapping testing). In contrast to van Berkel-van 
Hoof and colleagues (2016), they found that instead of increasing, per- 
formance decreased with time. This discrepancy however probably stems 
from the fact that they tested children after one and five weeks whereas 
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van Berkel-van Hoof and colleagues (2016) tested children only one or 
two days after the end of training. 

To summarize, in general, upon immediate testing, there is a gestural 
effect essentially for receptive learning in TD children as well as in chil- 
dren with speech and language impairments. This effect generally holds for 
delayed testing. Some studies however find no immediate advantage but 
do find a gestural advantage upon delayed testing, especially for expres- 
sive learning. Note that Brown and colleagues (2012) as well as McGregor 
(2014) found that performances in recall of newly learned words were 
better after 12 or 24 hours than immediately. 


3.8. Is observing the gesture during training enough 
or does producing the gesture work better? 


This section examines whether simply observing a gesture promotes word 
learning or if also producing it during the training phase could be more 
helpful. Some studies have indeed shown that imitating the gesture during 
training improves the beneficial effect of manual gestures for learning words 
in a foreign language (e.g., Macedonia, Bergmann, & Roithmayr, 2014). 
A total of five studies (Capone & McGregor, 2005; Booth, McGregor, & 
Rohlfing, 2008; McGregor et al., 2009; Like & Ritterfeld, 2014; Mumford 
& Kita, 2014) found a positive effect of adding a gesture during training to 
learn new words through observation of gesture in TD children. The same 
observation was made by Kohl, Karlan and Heal (1979) for one child with 
T21 and three with cerebral palsy, by Liike and Ritterfeld (2014) for chil- 
dren with SLI and by Mollink, Hermans and Knoors (2008) for children 
with hearing impairments. One could therefore conclude that producing 
the gesture during training is not necessary for a positive effect of ges- 
ture on word learning to emerge. Note that in all the above-mentioned 
studies (except Capone & McGregor, 2005 and Mollink, Hermans, & 
Knoors, 2008), the positive effect was only observed for receptive (and 
not expressive) learning. The results of studies in which the participants 
were explicitly asked to or were allowed to imitate the gestures (Bird et al., 
2000; de Nooijer et al., 2014; Kapalkova, PoliSenska, & Siissova, 2016; 
van Berkel-van Hoof et al., 2016) actually do not obtain better results in 
terms of expressive learning. McGregor and colleagues (2009) only found 
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a positive effect of gesture after a two- to three-day delay. Even though 
the children did not imitate the gesture during training, they did enact the 
‘under’ relationships learned on actual objects. This could have a similar 
effect as performing a gesture, which would explain the lack of a ges- 
ture advantage over word-alone training upon immediate testing: the chil- 
dren actually observed something close to a manual gesture even in the 
word only condition. The fact that a positive effect does emerge after delay 
could suggest that actual actions on objects even though positive for word 
learning are not as effective as manual gestures. 

Some interesting observations come from studies in which imitation 
was not forced and which analyzed the correlation between imitation 
and word learning performances. In Bird et al. (2000), the participants 
were not required to, but could, imitate the gestures during training. 
Correlational analysis showed no correlation between imitation vs. none 
and word learning performance for children with T21 but did find mod- 
erate to high correlation for TD children. De Nooijer and colleagues 
(2014) tested receptive verb learning and included both a condition in 
which participants only observed the gesture and one in which they were 
also asked to imitate the gesture. They did not find any advantage of imi- 
tating the gesture rather than just seeing it. 

It could also be the case that producing the gesture upon recall could 
facilitate the latter. Only one study directly controlled for gesture pro- 
duction during testing. O’Neill, Topolovec and Stern-Cavalcante (2002) 
found a positive correlation between descriptive gesture production during 
testing and receptive performance in TD children only for two out of the 
five adjectives learned in experiment 1 but for all adjectives in experi- 
ment 2. Kohl, Karlan and Heal (1979) also controlled for sign produc- 
tion during recall in children with disabilities but did not correlate that 
measure to word learning performance. The same can be said of Romski 
and Ruder (1984) for children with T21 and Capone (2007) for TD chil- 
dren. Mollink, Hermans and Knoors (2008) asked the hearing-impaired 
participants to repeat the word during training but no indication is given 
as to whether they imitated the signs. 

An interesting study relevant to the present question is that of de Nooijer 
and colleagues (2013). It was not included in the review itself because it did 
not directly compare the use of gestures during learning with that of other 
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cues. In their study, participants (N=120; mean age = 10 years) learned 
verbs always associated with a gesture during training. The focus was on 
analyzing the effect of gesture imitation (or not) by participant during 
learning and/or recall. The verbs were of three types: abstract, locomotion 
or object-manipulation. They found a positive effect of imitation (over 
none) only for the object-manipulation verbs. They suggest that gesture 
imitation would be crucial only when the action imitated is goal-directed. 

Imitation of the spoken word also appears to be a crucial factor. Bird 
and colleagues (2000) indeed found significant correlations between spon- 
taneous imitations of the word during learning and the expressive and 
receptive learning performances. There was no analysis of specific gestural 
imitations, which were infrequent in the TD participants. 

The study by de Nooijer and colleagues (2014) included an extra inter- 
esting condition in which they asked the children to invent a gesture for 
the word. They found that the children invented gestures very similar to 
those the experimenters had invented for the other experimental conditions 
(sometimes simplified). However, no clear positive or negative effect of this 
condition over the others (no gesture or gesture from the experimenter) 
emerged. 

In a nutshell, producing the manual gesture during training does not ap- 
pear to be absolutely necessary since several studies find a gestural advan- 
tage even if the children only observed the gesture during training. A few, 
but not all, studies however showed a positive correlation between pro- 
duction of the manual gesture during training and better word learning 
performances. Results are still too sparse to draw a strong conclusion. 
One study also showed that producing the gesture during testing results in 
better word recall. 


3.9. Analyzing generalization 


Another interesting aspect of learning is generalization. Once the words 
are learned in one context, it is indeed important to be able to generalize 
their use to other contexts. Category generalization corresponds to the 
capacity to be able to use nouns labeling objects for other objects of the 
same category (for example, being able to use the noun for the same object 
in a different color than during training). For action verbs, generalization 
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corresponds to the ability to use the verb to describe a similar action 
in another context (for example, being able to use the verb to describe 
the action but performed on different objects than during training). For 
adjectives, generalization is the ability to use the newly learned word to 
qualify a different object having the same property than that designated by 
the learned adjective. 

Booth, McGregor and Rohlfing (2008) found that TD children managed 
to receptively generalize the use of new words to objects from the same cat- 
egory more when the words were trained with a pointing gesture. Capone 
Singleton (2012) obtained the same results for iconic gestures underlying 
shape, both receptively and expressively. McGregor and colleagues (2009) 
found the same but only receptively and only after a two- to three- day 
post-training delay. In the two latter studies, the participants learned nouns 
for objects in three conditions: with an iconic gesture depicting the object’s 
shape, with one illustrating the object’s function or with none. They were 
then tested on the generalization of the use of the noun for another object 
of the same category (similar shape and same function). They did this 
more efficiently when they had learned the noun with a shape gesture than 
with no gesture (not true for function gestures, see above for discussion on 
this potential shape bias). 

O’Neill, Topolovec and Stern-Cavalcante (2002) found that descrip- 
tive gestures were more efficient then pointing to receptively generalize 
the use of adjectives to new objects by TD children. In Experiment 2, the 
adjectives described non-visible properties of the insides of objects and 
participants were asked to generalize the adjectives to objects of different 
shapes and colors but with the same non-visible property. 

Romski and Ruder (1984) found that signs helped children with 
T21 receptively generalize the use of new verbs and nouns but impeded 
expressive generalization. Participants learned verb/noun combinations 
describing actions performed on objects and were then tested on how they 
generalized the use of the nouns and verbs when they were combined dif- 
ferently than during training. 

Mumford and Kita (2014) compared the effect of adding manner or 
end-state gestures to verbs during learning with a control condition with 
no gesture. The participants were then tested receptively on their gener- 
alization performances. Children were taught a verb alongside a gesture 
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focusing on the manner of the action or one focusing on the end-state 
or none. They were then asked to select from two videos which one 
corresponded to the verb label. The materials were different but one of 
the videos displayed the same manner as during training and the other one 
the same end-state. They found that participants generalized more often 
based on manner when they had seen a manner gesture during training 
(no preference bias for end-state gestures and no gestures). The authors 
draw several conclusions from their findings: 1. the fact that the results are 
different for manner and end-state gestures suggests that the “gestural con- 
tent” plays a role; 2. the fact that children manage to generalize suggests 
that “gestures do not simply help children to associate a word with a 
scene in general”; 3. “iconic gestures provide a sketch of abstract semantic 
representations of verbs, which help children carry out fast mapping (...) 
of newly encountered verbs and correctly apply the verbs to novel complex 
scenes”. 

To summarize, several studies find that manual gestures play a posi- 
tive role in generalizing the use of newly learned words to new contexts. 
McGregor and colleagues (2009) suggest that: “gesture input promoted 
more robust knowledge of the meaning of under, knowledge that was less 
tied to contextual familiarity and more prone to consolidation” (p. 824). 


4. Summary and explanatory hypotheses for the 
role of manual gestures in word learning 


This chapter reviewed a total of 19 articles describing 20 experimental 
studies examining the effect of using manual gestures during training on 
word learning. Even though this was not a criterion for selection, they 
all tested children of various ages who were either typically developing 
(TD) or with speech and language difficulties from various origins: hearing 
impairment (HI), specific language impairment (SLI), trisomy 21 (T21), 
cerebral palsy (CP). Based on these studies and their results, several 
research questions were addressed in order to try and better understand 
the potential role of gestures in word learning. 

The first question (section 3.2.) was very general and asked whether 
manual gestures, whichever their type, actually improved word learning 
performances in TD children. Five out of eight studies involving a total of 
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212 children and directly comparing a word + gesture condition to a word 
only condition found better word learning performances in the word + ges- 
ture condition. Two studies, testing a total of 29 children, found no differ- 
ence between conditions. The results of all the studies reviewed therefore 
suggest that supplementing a word with a manual gesture during training 
is beneficial to word learning (see section 3.2. for more details). 

Section 3.3. examined the same research question in children with 
speech and language impairments. Six out of seven studies involving 74 
children and directly comparing a word + gesture condition to a word only 
condition found better word learning performances in the word + gesture 
condition for at least one group of children with speech and language 
impairments. Two studies, testing a total of 25 children, found no dif- 
ference between conditions. In these two latter studies, the children were 
not disadvantaged by manual gestures either. All this suggests that using 
manual gestures to teach new words to children with speech and language 
impairments could be useful. The studies reviewed indeed show that using 
gestures at worst does not have a positive effect and at the best promotes 
word learning, but in no cases impedes it. The evidence reviewed here 
actually suggests that children with speech and language impairments 
would benefit even more from manual gestures than TD children (see sec- 
tion 3.3. for more details). This could have implications for speech and 
language intervention. 

Section 3.4. tackled the question of the specificity of gestures over other 
additional cues for word learning. The results of the studies comparing the 
use of manual gestures, be it pointing or iconic representational gestures, to 
other cues, such as pictures, all suggest that there is a specificity of manual 
gestures which promote word learning more than other additional cues. 

Section 3.5. further examined the potentially different effects manual 
gestures could have on expressive vs. receptive learning. Expressive 
learning refers to the ability to produce the word when asked and receptive 
learning refers to the ability to comprehend the word after training. In gen- 
eral, performances in receptive learning are better than those in expressive 
learning. Receptive learning appears to be positively influenced by manual 
gestures as long as not too many words are taught. Expressive learning was 
shown to take longer than receptive learning and requires more training 
sessions. A manual gesture advantage thus only appears when a sufficient 
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number of training sessions is used. Finally, the studies reviewed suggest 
that the positive effect of manual gestures on expressive word learning is 
greater for children with speech and language impairments even though it 
is also true that these studies generally used more training sessions than 
those involving TD children (see section 3.5. for details). 

Section 3.6. aimed at analyzing whether different types of gestures had 
different effects. The studies reviewed suggest there are differences between 
manual gestures of different types in terms of effect on word learning. 
Pointing appears to be helpful for word learning even if the positive effect 
would be weaker than for iconic gestures. Even if the results do not all 
agree, iconics would be more beneficial than arbitrary signs even if the 
latter do also have a positive effect. It also appears that type of iconicity 
(e.g., underlining shape vs. function) would have differential effects even 
though this appears to interact with the type of words learned (e.g., nouns 
vs. verbs) (see section 3.6. for more details). 

Section 3.7. tackled the question of testing time and the potential 
differences between gestural effects for immediate vs. delayed testing. In 
immediate testing, a positive effect of manual gesture most often appears 
only for receptive learning in TD children as well as in children with 
speech and language impairments. This beneficial effect generally holds 
in time and still can be observed upon delayed testing. The positive effect 
sometimes appears only after a certain delay (at least one night) especially 
for expressive learning (see section 3.7. for details). 

Section 3.8. examined whether actually producing the gesture, vs. just 
observing it, during training yielded different results. The results of the studies 
reviewed suggest that gesture production during training is not mandatory 
for a beneficial effect of manual gestures to appear. Some results however 
suggest that producing the gesture during training would result in a greater 
positive effect of manual gestures. Production of the gesture during testing 
could also yield better recall performances (see section 3.8. for details). 

Finally, section 3.9. analyzed potential effects of manual gestures on 
generalization performances. Some results suggest that manual gestures 
could enhance generalization of the newly learned words to other contexts. 

In the following, we will address different hypotheses to explain why 
gestures would play a beneficial role in word learning based both on per- 
sonal thoughts and on those proposed by other authors. 
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4.1. Do gestures simply function as attention 
attractors to the word learning context? 


A hypothesis to explain why gestures facilitate word learning could 
be that using a gesture during training would function as an attention 
getter: the gesture would help focus the learner’s attention on the object 
of the training, i.e. the word pronounced by the experimenter. Joint atten- 
tion (Tomasello, Carpenter, & Liszkowski, 2007), which is important for 
learning, would be enhanced by the gesture. If this is the case, one could 
expect that gestures would not facilitate learning more than any other 
means of attracting attention. Some of the studies reviewed in this chapter 
provide information relative to this question. 

Booth, McGregor and Rohlfing (2008) compared the use of pointing 
towards the labeled object and the use of gaze towards the latter. They 
found that learning performances are better in the pointing than in the 
gaze condition. They also controlled for child attention by analyzing 
looking time of the participants when the experimenter labeled the object 
during learning. They found that the participants’ looking time towards 
the target object did not differ between conditions. This suggests that the 
level of attention was the same across conditions and that, even so, the 
use of pointing enhanced receptive word learning. The authors conclude 
that: “socio-pragmatic factors come to play a larger role than perceptual- 
attentional factors in word learning by the time the children reach 2 1⁄2 
years of age” (p. 198). This suggests that there is something more to 
manual gestures than just drawing attention to the word learning context. 
Note however that O’Neill, Topolovec and Stern-Cavalcante (2002) found 
a percentage of correct responses in the pointing condition very close to 
chance suggesting that learning did not occur in the pointing gesture con- 
dition whereas it did in the descriptive gesture condition in Experiment 
1. Performances were however higher than chance in the pointing condi- 
tion in Experiment 2. 

Bird and colleagues (2000) did not find better expressive and receptive 
word learning performances when words were associated with an arbi- 
trary gesture than when they were associated with no gesture. If gestures 
solely functioned as attention getters, arbitrary gestures should also attract 
the participants’ attention. 
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Kapalkova, Poligenska and Siissova (2016) found that providing a manual 
iconic gesture during training yielded better expressive learning performances 
compared to providing a picture. McGregor and colleagues (2009) also found 
that an iconic gesture was more efficient than a photograph for learning the 
meaning of the preposition ‘under’. It however seems that gesture and picture/ 
photographs should equally function as attention getters. Capone Singleton 
(2012) compared the effect of shape gestures, function gestures and pointing 
gestures and found that shape gestures yielded better expressive learning and 
category generalization performances than function and pointing gestures. 
Even if one could argue that a representational gesture could function as a 
stronger attention getter than a pointing gesture, there is no reason why a 
representational gesture underlining shape would focus attention more than 
one underlying function. The authors suggest that “Whereas both pointing 
and iconic gestures can draw attention to an abject, the iconic gesture may 
also orient children to attend to or strengthen their inferences about specific 
features and their connection to the word label” (p. 289-290). 

Vogt and Kauschke (2017a) found a positive effect of iconic gestures 
over an attention directing gesture (raised forefinger in front of upper 
body) both for fast and slow mapping expressive and receptive learning 
of new words by children with SLI as well as age-matched and language- 
matched TD children: “iconic gestures provide an advantage over and 
above focusing children’s attention” (p. 21). Note that receptive and 
expressive learning also occurred in the attention-directing gesture condi- 
tion even though to a lesser extent than in the iconic gesture condition. The 
advantage of iconics vs. the attention-directing gesture further depended 
on testing time and word type: it was only significant for verbs for fast 
mapping and for nouns for slow mapping. 

Put together, all this evidence suggests that either gestures have an addi- 
tional role than just attracting the learner’s attention, even more so for iconic 
representational gestures, or that gestures attract attention more than other 
cues as hypothesized by McGregor and colleagues (2009): “gestures are 
interesting, and thus draw more attention to moments of training” (p. 822). 


4.2. Memory enhancement 


Manual gestures could function as additional traces to help the learner 
memorize the new words more efficiently. Several of the studies reviewed 
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in this chapter provide insight on this hypothesis. McGregor and colleagues 
(2009) put forward a gestural advantage only after two- to three-days 
and not immediately. Like and Ritterfeld (2014) also found a gestural 
advantage only after a one-week delay and not immediately for expres- 
sive learning. Finally van Berkel-van Hoof and colleagues (2016) found 
that the gestural advantage grew over time. All this evidence suggests that 
gestures help learners memorize new words longer than when the words 
are trained without a gesture. 

Mollink, Hermans and Knoors (2008) compared word learning 
performances after one and five weeks. They found a gestural advantage 
but no differential effect between strong and weak iconicity signs after one 
week. After five weeks, performance decreased but only for weak iconicity 
gestures and not for strong iconicity ones. This suggests that iconicity 
favors longer memorization. 

Studies on word and sentence list recall also provide information on the 
question. In these studies, participants are generally provided with a list 
of words or sentences that they are instructed to try to memorize. Cohen 
and Otterbein (1992) showed that pantomimic and non-pantomimic 
gestures both favored the recall of sentence lists when these did not form 
a narrative. Feyereisen (2006) found a similar result for representationals 
and non-representationals and Thompson (1995) for iconics. Igualada, 
Esteve-Gibert and Prieto (2017) showed that three-to five-year-old chil- 
dren recalled words better when these were presented with beat gestures 
than when they were presented alone. All these studies suggest that when 
manual gestures are present during encoding of the memory, it is better 
encoded. The same phenomenon could be at work for word learning. 


4.3. Cognitive load minimization 


Manual gestures may also minimize the cognitive load involved in word 
learning as suggested by Goldin-Meadow and colleagues (2001). This 
study did not directly tackle word learning. Participants first viewed a 
sequence of letters they were asked to memorize, they then had to solve a 
math problem after what they were asked to recall the sequence of letters. 
Those who could gesture during math problem solving recalled more let- 
ters than those who were prevented from gesturing. The authors suggest 
that gesturing lessened cognitive load during problem solving freeing space 
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for memorizing the sequence of letters. McGregor and colleagues (2009) 
suggest that gesture “externalized a meaningful aspect of the referent in 
the visual world. By making that meaning more obvious, gesture may free 
cognitive-linguistic resources for processing the word itself and, perhaps 
the other lexical and syntactic elements involved” (p. 823). Gestures could 
help reinforce the link between the lexical form and the concept it refers to 
by putting forward a distinctive property of the meaning the word refers 
to, depicting it and attracting the learner’s attention to it. Illustrating this 
link could free part of the cognitive load involved in finding this property 
and processing this link. This could free more cognitive load to actually 
learn the lexical form associated with it. 


5. Conclusion 


This review of experimental evidence of the role of manual gestures for 
word learning shows that even if there is not a gestural advantage in 
all studies, the majority of them show that words are learned more effi- 
ciently when they are associated with a manual gesture during training 
in typically developing children. This effect appears to be even stronger 
in children with speech and language impairments. Manual gestures are 
more efficient than other additional cues such as pictures. Even if manual 
gestures appear to play a positive role for both expressive and receptive 
word learning, receptive learning is faster. The gestural advantage seems to 
be present for different types of manual gestures, it appears to be strongest 
for gestures bearing a physical resemblance with the referent. Some studies 
find a gestural advantage even immediately after training while others find 
this advantage only after a delay. 

The evidence suggests that manual gestures play a deeper role than just 
attracting the learner’s attention. Gestures could facilitate word learning 
by enhancing memorization and/or alleviating cognitive load. 
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Interference in memory consolidation of 
non-native speech sounds 


Abstract: For several decades, researchers have been investigating the challenges 
of and constraints on learning the speech sound inventory of a second language in 
adulthood. Commonalities that have emerged from these findings include the im- 
mense individual variability reported in non-native speech sound learning studies 
and the rare attainment of native-like proficiency in perception or production of 
second language speech sounds. While numerous studies have shed light on various 
aspects of this challenging process, many questions about the extent and nature of 
these difficulties remain. A nascent line of research suggests that some of the diffi- 
culty in non-native speech sound learning could be attributed to various sources of 
interference that disrupt the memory consolidation process, thus interfering with 
the retention of learned phonetic information. It is well documented in the broader 
learning literature that interference from competing stimuli or subsequently learned 
skills can disrupt memory consolidation processes. However, this phenomenon has 
received little attention in the speech literature, and the potential sources of inter- 
ference in the speech domain have yet to be identified. In this review, I discuss how 
integrating theories of memory consolidation with non-native speech sound learning 
models can more accurately capture patterns of learning observed in the non-native 
speech sound learning literature, specifically patterns showing failures of memory 
consolidation due to interference. 


Keywords: sleep, second language learning, adults, memory consolidation, non- 
native speech sounds 


1. Introduction 


Adult second language learners face many challenges, especially when 
attempting to master the speech sounds of a non-native language. Although 
a second language learner must gain proficiency in a number of linguistic 
domains (e.g., morphology, syntax, semantics), acquiring perceptual sen- 
sitivity to the speech sounds of another language is an important step 
in language acquisition. Indeed, several studies support the notion that 
speech perception abilities can facilitate higher levels of language learning 
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(e.g., lexical acquisition). For example, native speech perceptual abili- 
ties in infancy have been shown to predict language development in early 
childhood (Tsao et al., 2004; Kuhl et al., 2008), and in adulthood, supe- 
rior perceptual discrimination of non-native speech contrasts can facil- 
itate learning of lexical items that contain those contrasts (Silbert et al., 
2015). Thus, mastery of the speech sound inventory of a language may 
be a crucial step in language acquisition more generally. Unfortunately, 
however, most studies of second language learners report rather poor per- 
ceptual abilities for difficult phonetic contrasts (e.g., Bradlow et al., 1999; 
Flege, 2003), even when learning commenced in childhood (Pallier et al., 
1997). As such, several types of training paradigms have been devised 
in an attempt to optimize non-native speech sound learning, specifically 
for difficult speech sound contrasts. Paradigms that have been employed 
in training studies differ in terms of whether learning takes place in an 
implicit (Lim & Holt, 2011; Vlahou et al., 2012; Wade & Holt, 2005) or 
explicit manner (e.g., Earle et al., 2017; Earle & Myers 2015a, 2015b), 
or whether participants were trained on tokens with limited variability 
or high variability (e.g., tokens produced by multiple talkers or occurring 
in multiple phonological contexts, Logan et al., 1991; Lively et al., 1993, 
1994; Bradlow et al., 1997, 1999). Other paradigms utilized either nat- 
ural speech or exaggerated versions of speech to make differences between 
stimuli more salient (McCandliss et al., 2002; Golestani & Zatorre, 2004; 
Swan & Myers, 2013). Although each of these training paradigms has 
indeed demonstrated learning, the majority of these studies report only 
moderate success, and most second language learners ultimately fail to 
attain native-like perception or production of non-native speech sounds 
(e.g., Bradlow et al., 1999; Piske et al., 2001; MacKay et al., 2001). 

Most studies probing plasticity in the speech system focus on the ini- 
tial processes of learning speech sounds; however, language acquisition 
involves more than the initial encoding of stimuli. In order for an indi- 
vidual to develop and maintain language proficiency, critical aspects of 
a language, such as phonemes and lexical items, need to be consolidated 
into long-term memory for later retrieval. Stable long-term memory 
representations may then facilitate retention and generalization of learned 
speech sounds. Until recently, the role of memory in speech sound acqui- 
sition had been underexplored, and therefore, many questions about the 
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memory functions underlying this process remain. Due to the general lack 
of success in non-native speech sound learning, it is becoming more ap- 
parent that consistent failures of memory consolidation may underlie the 
challenge of learning non-native speech sounds, and this merits further 
exploration. More specifically, training conditions that facilitate stronger 
initial learning of speech sounds and limit exposure to interference after 
training and before consolidation takes place may be key factors in pro- 
moting long-term changes in the speech system. The current chapter begins 
with a review of models of non-native speech sound learning, followed by 
a discussion of memory consolidation theories and experimental evidence 
for those. In light of this evidence, I conclude with an alternative interpre- 
tation of some work on non-native speech sound learning and discuss how 
considering strength of learning and interference in memory consolidation 
may have explanatory power for some of the challenges reported in these 
studies. 


2. Constraints on non-native speech sound learning: 
Perceptual similarity of first language speech sounds 


By the first few months of life, infants demonstrate perceptual sensitivity 
to many different speech sounds that are found in world languages (Eimas 
et al., 1971). However, by the end of the first year of life, infants lose the 
ability to discriminate certain phonetic contrasts that do not occur in the 
ambient language (e.g., Werker & Tees, 1984; Kuhl et al., 2006). While 
this warping of perceptual space appears to facilitate language acquisition 
in early childhood (Tsao et al., 2004; Kuhl et al., 2008), the process of per- 
ceptual reorganization can greatly constrain the acquisition of non-native 
speech categories later in life. In particular, perceptual similarity of native 
and non-native speech sounds can cause non-native speech sounds to be 
more difficult to perceive as distinct categories (e.g., Best et al., 2001). 
For example, native English speakers often struggle to perceptually dis- 
tinguish dental and retroflex voiced stop consonants found in Hindi. The 
perceptual similarity of these sounds to the English alveolar /d/ category 
and their similarity to each other make these exceptionally challenging 
for learners to disambiguate. Indeed, several models of non-native speech 
sound learning account for such difficulties by considering the relationship 
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between native and non-native speech sounds. For example, the perceptual 
assimilation model (e.g., Best, 1994; Best & Tyler, 2007) and the native 
language magnet model (Kuhl, 1994; Kuhl et al., 2008) both posit that 
non-native speech sounds that are perceptually similar to native phonemes 
will be harder to perceive than perceptually dissimilar sounds. However, 
the two models differ on which dimensions of the speech signal are con- 
sidered important for perception. The perceptual assimilation model 
focuses on naive listeners’ perception of non-native speech sounds as a 
result of articulatory similarity between native and non-native speech 
sounds (Best, 1994; Best & Tyler, 2007). Specifically, this model predicts 
that naive listeners will assimilate unfamiliar non-native speech sounds 
to the perceptual category in the native language that is produced by the 
most similar articulatory gesture (Best & Tyler, 2007). For example, native 
English speakers often map dental and retroflex voiced stop consonants 
found in Hindi onto the alveolar /d/ sound found in English. As a result, 
certain speech sounds are more difficult to perceive, while speech sounds 
without a similar native language category are perceived more easily (Best 
et al., 2001). For instance, English contains no speech categories similar 
to click sounds found in Zulu, and native English speakers typically dis- 
criminate these sounds accurately (Best et al., 1988). In contrast to artic- 
ulatory gestures, the critical dimension of the native language magnet 
model is acoustic space. This model concentrates on the developmental 
processes underlying the acquisition of native speech sounds (Kuhl, 1994; 
Kuhl et al., 2008) and postulates that infants take advantage of statis- 
tical learning in order to acquire the speech sound categories of their na- 
tive language. After sufficient exposure, infants’ perceptual space becomes 
“warped,” and prototypes for native language speech categories emerge. 
These prototypes act as magnets that attract perceptually similar speech 
sounds. Analogous to the perceptual assimilation model, this magnet effect 
would result in some non-native speech sounds (i.e., perceptually similar 
sounds) being more difficult to learn than others (perceptually dissimilar 
sounds). While similar in some ways to the perceptual assimilation model 
and the native language magnet model, Flege’s (1995) speech learning 
model takes a slightly different approach. First, this model focuses on 
experienced adult second language learners and proposes that difficulties 
in second language speech production stem from perceptual obstacles. In 
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other words, one cannot produce what one cannot perceive. An additional 
component of this model predicts that non-native speech sound learning 
becomes more difficult over the lifespan; however, it deviates from tra- 
ditional critical period hypotheses, as it predicts a gradual decrease in 
non-native speech production abilities over the lifespan, rather than an 
abrupt decline after puberty. While this model emphasizes adult second 
language learners’ speech production abilities, it is similar to the models 
described above in that it attributes difficulties with the second language 
sound system to similarities with the native language. 

Attention to dimension models add to these models by clarifying the 
processes required for acquiring non-native speech categories. Attention 
to dimension models propose that a speaker of a language has learned to 
direct attention to relevant parts of the acoustic signal for his or her na- 
tive language and that learning new speech categories requires a learner 
to attend to previously unattended dimensions of the signal (Francis & 
Nusbaum, 2002). Specifically, native speakers of a language have learned 
to direct attention to relevant acoustic cues in the speech signal, and they 
have simultaneously learned to ignore other cues. Learning to reweight 
acoustic cues or to direct attention to different parts of the speech signal 
presents a challenge for learners. This model differs from the others 
described above in that it explains how a listener’s perceptual space 
changes as a result of experience with the native language (i.e., via selec- 
tive attention to meaningful features). 

While each of these models focuses on distinct phases of the acquisi- 
tion process (e.g., infants, naive listeners, or experienced second language 
learners) and accounts for challenges in different ways (e.g., development, 
articulatory representations, production, or dimensions of the acoustic 
signal), some similarities emerge from a close comparison of them. For 
example, these models do not necessarily attribute poor second language 
perception and production abilities to a critical period or a loss of neural 
plasticity; rather, difficulties are attributed to prior experience with the first 
language and how that experience shapes perception and production of 
new speech categories. Clearly, the native language exerts constraints on 
perception and production of second language speech sounds, and these 
constraints are especially strong when speech sounds in the native and 
non-native languages are close in perceptual space. These models certainly 
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capture a great deal of the non-native speech sound learning process, espe- 
cially the relative difficulty of different speech sounds as a result of native 
language background. However, many non-native speech training studies 
observe considerable variability even among individuals of the same 
language background (e.g., Golestani & Zatorre, 2004, 2009; Myers & 
Swan, 2012; Yi et al., 2014), which suggests that factors beyond the native 
language constrain this process. 


3. Memory consolidation in speech and language learning 


To account for individual variability in speech sound learning, it may be 
helpful to consider how models of memory consolidation predict reten- 
tion, forgetting, or elaboration of learned perceptual information. Most 
models of memory consolidation posit that two separate memory systems 
work in tandem to consolidate different types of learning. Specifically, the 
fast, hippocampus-mediated and often sleep-dependent system serves to 
consolidate declarative or explicit memories in order to integrate them 
into existing networks of knowledge, while the slow, hippocampus- 
independent system associated with implicit or non-declarative learning 
induces local changes to neuronal circuitry as a result of continued expe- 
rience and does not typically rely on sleep (e.g., complementary learning 
systems, McClelland et al., 1995; McClelland, 1998; see also Marshall & 
Born, 2007; Dudai, 2004). With extensive exposure over longer periods 
of time, the slow, procedural learning system is able to discover regulari- 
ties in the input from the environment. As a result, this memory system is 
able to sort information into categories, and this knowledge about catego- 
ries can aid generalization to new contexts or exemplars (see McClelland 
et al., 1995 for discussion). In connectionist models, rapid sequential 
learning (i.e., learning one task immediately after another), leads to what 
is called catastrophic interference (McCloskey & Cohen, 1989). In other 
words, networks must completely forget or overwrite the information they 
have learned in order to accommodate new information. According to 
McClelland (1998), catastrophic interference underscores the need for the 
rapid, hippocampus-mediated consolidation system: the hippocampus acts 
as a temporary memory store and allows for memory traces to be selec- 
tively consolidated into long-term memory and integrated into existing 
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networks of information. Crucially, this process of selective consolidation 
via the hippocampus does not result in information being overwritten. 

A rich literature on memory consolidation suggests that sleep can 
improve performance on a task even in the absence of further practice 
through a period of off-line consolidation (Stickgold et al., 2000; see 
Marshall & Born, 2007 for review). Sleep has been shown to facilitate 
abstraction of information from episodic traces, and this integration of 
information into existing knowledge networks allows learning to gener- 
alize to novel contexts (Davis et al., 2009). Sleep-mediated consolidation 
has also been shown to help protect newly learned information from inter- 
ference (e.g., Ellenbogen et al., 2006, 2009; Drosopoulos et al., 2007) or 
recover learning that decayed throughout the course of a day (Fenn et al., 
2003). Several studies have additionally investigated the contributions of 
sleep-dependent memory consolidation to word learning (e.g., Tamminen 
& Gaskell, 2013; Dumay & Gaskell, 2007; Davis et al., 2009). Consistent 
with McClelland’s complementary learning systems account, a study by 
Davis et al. (2009) found that newly learned words become integrated 
into the existing lexicon after a period of sleep. In this study, participants 
learned novel words that overlapped with existing lexical items by several 
phonemes (e.g., cathedruke-cathedral), and these novel words only showed 
evidence of lexical competition with existing words (e.g., cathedral) on a 
lexical decision task after a period of sleep had occurred. Additionally, 
brain activation measured by functional magnetic resonance imaging 
(fMRI) in this study revealed activation of the hippocampus in response 
to the novel lexical items before sleep and cortical activation after sleep. 
This suggests that the newly learned lexical items were temporarily stored 
in the hippocampus and became re-represented in cortical areas after 
sleep, which is in line with predictions from McClelland’s complementary 
learning systems account. In addition to offline gains in the absence of fur- 
ther practice, sleep between two periods of practice in word learning can 
benefit long-term retention of novel words (Mazza et al., 2016), and even 
daytime naps or short periods of sleep can improve or stabilize learning 
of tasks, including word learning (Lahl et al., 2008; Heim et al., 2017). 
The extant literature shows clear benefits of sleep for word learning; how- 
ever, of interest to the current chapter is whether similar findings are also 
observed in speech learning. 
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Indeed, some studies have found benefits of sleep for speech or auditory 
learning tasks (see Earle & Myers, 2014 for review). For example, Fenn 
et al. (2003) carried out an experiment in which participants learned to 
understand computer-synthesized versions of native language words. In 
this study, participants who slept between training and test improved in 
the absence of further practice. Moreover, a degradation of performance 
was observed for participants who were trained in the morning, but this 
loss was recovered following sleep. Thus, sleep may help recover perfor- 
mance that has degraded throughout the course of a day, as well as facil- 
itate generalization to novel contexts. In fact, a later study by the same 
group suggests that sleep promoted generalization of information learned 
with a large amount of variability, but not when learning took place with 
a closed set of tokens (Fenn et al., 2013). In similar fashion, Xie et al. 
(2017) tested listeners on generalization of learning of one Mandarin- 
accented talker to a novel Mandarin-accented talker. Like the Fenn et al. 
(2003) study, Xie and colleagues (2017) found that performance on the 
untrained talker degraded throughout the day for a group trained in the 
morning. Thus, sleep was necessary for generalization to a novel talker 
and was not simply observed with the passage of time. Although these 
studies have found performance gains following sleep, other work on sleep 
in auditory learning and perceptual learning of speech have not observed 
any added benefits of sleep. For example, Roth and colleagues (2005) 
found that sleep was not necessary for improvement on a speech in noise 
task, but rather, the passage of time was sufficient to induce performance 
gains. Similarly, Eisner and McQueen (2006) found no additional benefit 
of sleep for perceptual learning of speech. In a lexically-guided percep- 
tual learning task, participants were exposed to lexical items containing 
non-canonical productions (an ambiguous fricative between /f/ and /s/) 
of certain speech sounds embedded in a lexical context that served to dis- 
ambiguate the speech sound. After exposure, participants were asked to 
categorize tokens on a non-word continuum from /f/-/s/. In this paradigm, 
a shift in the category boundary, (i.e., categorization of more ambiguous 
tokens in a non-word context consistent with the lexical bias in the expo- 
sure condition) indicates perceptual learning. Indeed, participants in this 
study showed a category boundary shift consistent with their exposure 
condition both immediately after training and 12 hours later. Interestingly, 


Interference in memory consolidation 215 


a group that slept during the 12-hour interval showed no greater learning 
effect than a group that remained awake during the day throughout the 
post-training interval. 

Several differences could explain the discrepancies in the findings of 
these studies. First, different learning systems may underlie the various 
tasks employed in these studies. For example, the Eisner and McQueen 
(2006) study measured adaptation to episodic representations of talker- 
specific idiosyncrasies in speech production (that is, how that talker pro- 
duced a particular speech sound in a non-standard way). Recent evidence 
suggests that adaptation to talker-specific, episodic information (i.e., 
details of a specific talker’s voice) occurs rapidly and is stable over time, 
while tasks involving more abstract representations may emerge only 
after consolidation (Brown & Gaskell, 2014). This could explain why 
both groups in the study by Eisner and McQueen (2006) remained stable 
over time, regardless of sleep. The fact that participants in the study by 
Fenn and colleagues (2003) were trained and tested on completely dif- 
ferent words may explain why these participants benefitted from sleep. 
Abstract representations facilitate generalization to novel contexts (in this 
case, generalizing knowledge of synthetic speech sounds to new lexical 
contexts), and because abstract representations manifest only after offline 
consolidation, improvement after sleep in this context is not surprising. 
However, studies by Earle and Myers (2015a) and Earle et al. (2017) have 
found overnight improvement on tasks even when no generalization was 
needed (i.e., the same tokens were used in training and testing). 

Sleep may be especially advantageous when learning involves the forma- 
tion of new representations, rather than adapting existing representations 
to accommodate atypical exemplars. For instance, several recent studies 
have found benefits of sleep in non-native speech sound learning. Earle 
and Myers (2015b) trained participants to learn the Hindi dental/retroflex 
contrast on a closed set of tokens but found that sleep enabled generaliza- 
tion to stimuli produced by novel talkers. Earle et al. (2017) even found 
that duration of sleep predicted overnight improvement on the tasks used 
to assess non-native speech sound learning. Specifically, they found that 
the amount of slow wave sleep predicted overnight gains on identification 
of the speech sounds, while total sleep duration predicted participants’ 
ability to discriminate the speech sounds. Importantly, this study suggests 


216 Pamela Fuhrmeister 


that individual differences in sleep duration may account for some of the 
individual variability typically observed in training studies. On the other 
hand, an additional study by this group found some surprising limits to the 
benefits of sleep consolidation (Earle & Myers, 2015a). This study likewise 
trained participants on the Hindi dental/retroflex contrast and observed 
improvement after sleep, but this advantage only held if participants 
had been trained in the evening. This study consisted of two groups of 
participants who were trained on the contrast. One group was trained 
in the morning hours and one in the evening hours. Each group returned 
approximately 12 and 24 hours later for reassessment. Surprisingly, only 
the evening-trained participants showed improvement following an over- 
night interval. The authors reasoned that this discrepancy could be a result 
of interference from native language exposure. Specifically, participants 
trained in the morning had a day’s worth of input from their native 
language prior to sleep, while the evening-trained group presumably had 
much less. Subsequent experiments in this study indicated that the lack of 
overnight improvement seen for the morning trained group stemmed from 
exposure to perceptually similar native language speech sounds. 

Other recent studies have similarly found that exposure to certain 
stimuli or engagement in certain tasks can interfere with learning or 
consolidation of newly acquired skills or representations. For example, 
alternating perceptual training with speech production practice has been 
shown to attenuate or interfere with learning. In a study by Baese-Berk and 
Samuel (2016), native Spanish-speaking participants were trained to learn 
a difficult, non-native Basque speech contrast in the laboratory. Training 
consisted of an ABX discrimination task in which participants heard three 
sounds and were asked to indicate whether the third was more similar to 
the first or second sound presented. For this study, some participants were 
asked to repeat the third sound presented on each trial out loud before 
indicating their decision. Surprisingly, the group that repeated the trained 
sound showed very little learning of the contrast on a perceptual post- 
training assessment. Furthermore, production of any sounds, even sounds 
that were dissimilar to the trained contrast, seemed to attenuate learning. 
This suggests that the cause of this interference was not solely a result 
of exposure to the participants’ own poor productions of the non-native 
contrast. Rather, it raises questions about the mechanisms underlying 
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speech perception and production in the development of speech sound 
representations. 

Exposure to phonological variability before or after training may addi- 
tionally attenuate learning and disrupt consolidation. For example, a 
study by Fuhrmeister and Myers (2017) examined whether native English- 
speaking participants trained on a non-native, Hindi dental/retroflex con- 
trast would benefit from additional exposure to the contrast in a different 
vowel context. In this study, one group of participants heard the contrast 
in only one vowel context in minimal pair non-words (/dug/ and /dug/) 
throughout training and testing. Another group heard the contrast in two 
different vowel contexts in assessments (/dug/ and /dug/ vs. /dig/ and /dig/; 
assessments consisted of a pretest, immediate posttest, and a delayed post- 
test), but they had identical training to the other group (i.e., /dug/ and /qug/ 
only). Notably, the participants who were exposed to the Hindi sounds in 
two vowel contexts performed significantly worse than the group exposed 
to one vowel context only on the tasks involving the contrast in the 
trained vowel context, despite having more total exposure to the contrast. 
Additionally, participants exposed to the contrast in two vowel contexts 
showed no evidence of overnight improvement on the stimuli in the trained 
vowel context, while those who heard the sounds in only one vowel context 
did improve after an overnight interval. These findings suggest that expo- 
sure to novel speech sounds in different vowel contexts may interfere with 
learning or consolidation of the contrast in a trained vowel context, even 
if that extra exposure is limited (i.e., at test only). It is also possible that 
the learning of the trained vowel context was less stable for participants 
exposed to two different vowel contexts, which may have prevented further 
improvement as a result of sleep. As can be seen, memory consolidation 
influences speech learning in the following ways: 


e Sleep helps consolidate newly formed representations of both natural 
and synthetic speech sounds as indicated by performance improvement 
(Earle & Myers, 2015a; Earle et al., 2017) and generalization (Earle & 
Myers, 2015b; Fenn et al., 2003, 2013). 


e Not all types of perceptual speech learning tasks show improvement 
after sleep (Roth et al., 2005; Eisner & McQueen, 2006), and this 
may depend on task difficulty, whether new representations are being 


218 Pamela Fuhrmeister 


formed, or whether existing representations are being expanded to 
accommodate new exemplars. 

e Exposure to certain stimuli (e.g., perceptually similar native language 
speech sounds) following training may interfere with learning of novel 
speech sounds (Earle & Myers, 2015a). 

e Training conditions (e.g., exposure to phonological variability, 
Fuhrmeister & Myers, 2017; production of speech sounds or words, 
Baese-Berk & Samuel, 2016) may destabilize or attenuate learning, 
which may affect the consolidation process. 


Although sleep appears to facilitate memory consolidation of speech in a 
variety of ways, many questions remain. For example, it is unclear what 
types of stimuli might interfere with non-native speech sound learning or 
under what conditions interference effects could be avoided. However, it 
may be possible to carry over insights from other domains in order to 
inform these questions and make predictions about speech learning. 


4. Failures of consolidation: Interference effects in learning 


In order to fully understand how new memories are formed, it is impor- 
tant to examine cases in which consolidation fails. Over a century ago, 
Müller and Pilzecker (1900) proposed that memories exist in an initially 
labile state, in which they are subject to interference from subsequently 
learned tasks. In their studies, they tested explicit recall of strings of unre- 
lated digits and observed that their participants were not able to recall 
one list as well when tested 24 hours later if they had learned a subse- 
quent list immediately following practice on the first list. Walker’s (2005) 
model for procedural memory consolidation similarly assumes that newly 
acquired memory traces are fragile and must undergo a process of stabili- 
zation before becoming resistant to interference. This model is comprised 
of three main stages: acquisition, consolidation-based stabilization, and 
consolidation-based enhancement. Walker (2005) argues that the stabili- 
zation stage depends on the passage of time only, while the enhancement 
stage relies on sleep. Two dissociable systems have been proposed in the 
memory literature, which presents a challenge for extending theories of 
memory consolidation to other domains. The declarative memory system 
underlies explicit learning of facts or episodes (sometimes referred to as the 
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memory for “what”), while the procedural memory system serves memory 
of implicitly acquired actions or procedures (the memory of “how”) (e.g., 
Squire, 2004). Although Walker’s model was originally intended for proce- 
dural memory and Müller and Pilzecker’s account for declarative memory 
(though their account predates this term), some findings suggest proce- 
dural and declarative memory systems may not be as dissociable as once 
thought (Poldrack et al., 2001). In addition, studies including both declar- 
ative and non-declarative tasks lend support to the notion that the con- 
solidation of a newly acquired skill or memory can be disrupted if an 
interfering task or stimulus is introduced before the memory has stabi- 
lized. Furthermore, it remains unclear whether speech category formation 
can be neatly classified as either procedural or declarative learning, and 
this process may be different at different points throughout the lifespan 
or under different learning conditions. Therefore, for the remainder of 
this review, I will draw on the literature of learning and memory pro- 
cesses for both declarative and procedural tasks and will reflect on the 
importance of a stabilization period following initial encoding in order 
to help new memory traces become resistant to interference. This section 
reviews a series of studies that have examined interference effects in several 
domains of learning, including the time course required for stabilization 
and consolidation of memory traces and the strength of initial learning or 
encoding. The goal of drawing on this literature is to make predictions 
about how speech sound learning may be facilitated by mitigating interfer- 
ence or adhering to a training paradigm or schedule that is more conducive 
to consolidation and long-term retention. 

A seminal study in the motor learning domain demonstrated inter- 
ference with a behavioural task, in which participants learned to move 
a two-hinged handle to a target while compensating for perturbation 
(Brashers-Krug et al., 1996). Participants who learned to compensate for 
perturbation experienced a disruption of consolidation if, immediately 
following training, they were trained to compensate for perturbation in 
the opposite direction. Another group of participants completed identical 
training on the first task, but their second task consisted of moving the 
handle to a target in the absence of any perturbation. These participants 
showed no interference effect from the second task. An additional group 
completed the two training sessions with perturbation in opposing 
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directions but waited four hours between the two training episodes. This 
group also showed performance improvement on the first task, indicating 
that a period of four hours was sufficient to stabilize learning of the first 
task, making it immune to disruption from a second task. Similarly, a study 
by Walker et al. (2003) demonstrated interference effects in a finger tap- 
ping sequence task unless six hours had passed between the two training 
sessions. In addition to behavioural tasks, the application of transcranial 
magnetic stimulation (TMS) as a source of interference has been tested 
(Muellbacher et al., 2002). In this motor learning study, participants prac- 
ticed a finger movement sequence and were assessed on their improvement 
in speed. When TMS was administered immediately following training, 
participants showed no retention of the behavioural gains observed during 
the learning phase. However, if a period of six hours had lapsed before 
TMS was applied, no interference was observed. These studies support 
the consolidation hypothesis and Walker’s consolidation model by dem- 
onstrating that the passage of time is necessary to stabilize newly encoded 
motor memories. Furthermore, the type of task that follows learning 
may dictate whether learning on the first task is disrupted. These findings 
may be able to make important predictions in the speech domain. For 
example, if learners train on non-native speech sounds and are exposed 
to speech sounds in their native language before the stabilization period 
has concluded, consolidation of the non-native sounds may be obstructed. 
Similarly, it may also be the case in the speech domain that not all stimuli 
or tasks interfere equally. If that is indeed the case, it will be important 
to identify which types of stimuli or tasks (e.g., native language expo- 
sure, Earle & Myers, 2015a; speech production, Baese-Berk & Samuel, 
2016) are able to interfere with speech sound learning. 

Similar task and timing effects to those found in motor learning 
have been observed in visual perceptual learning tasks. For example, in 
a visual hyperacuity task, participants saw two presentations of three 
dots arranged vertically on a screen and were asked to indicate whether 
the middle dot was offset in the either the first or second group of dots 
presented (Seitz et al., 2005). Following training, participants completed 
training on another task: in one task, the presentation of the dots was the 
same except the offset was presented in the opposite direction, and other 
tasks varied the spatial location and the orientation of the dots (i.e., the 
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dots were presented horizontally). Crucially, only the participants who 
were trained with the opposite offset direction experienced interference. 
Visual perceptual learning of stimuli that were presented at different spa- 
tial locations or in different orientations did not interfere with initial 
encoding of the task, which the authors attributed to the retinotopic 
specificity of spatial location and orientation. Another critical finding in 
this paper was that participants who waited one hour before training on 
the opposite direction did not demonstrate any attenuation of learning 
on the first direction. An additional visual perceptual learning study 
using a line orientation detection task found a period of 3.5 hours to be 
sufficient to eliminate retrograde interference from a second visual task 
(Shibata et al., 2017). These findings from visual learning studies provide 
further support that fragile memory traces remain susceptible to interfer- 
ence until a period of stabilization has passed. Like the motor learning 
study by Brashers-Krug et al. (1996), the study by Seitz et al. (2005) 
shows that not all tasks have the potential to interfere with consolidation 
of a previously learned task. Evidence from these two domains, namely 
vision and motor learning, suggests that domain-general processes may 
underlie consolidation of learning and may therefore be applicable to the 
speech domain. 

In further support of the consolidation hypothesis, one study investi- 
gating interference from consecutive tasks in patients with amnesia found 
surprisingly similar stabilization effects, despite the fact that declarative 
memory consolidation deficits are a hallmark of amnesia. Dewar et al. 
(2009) had individuals with amnesia learn word lists, and these participants 
showed a graded advantage in recall after the presentation of interfering 
stimuli at different time points. Participants experienced a delay between 
the initial learning session and the presentation of interfering stimuli, and 
longer delays facilitated recall of the original word lists more effectively 
than shorter delays. This suggests that even individuals with amnesia who 
have declarative memory consolidation deficits can benefit from a stabili- 
zation period following learning. 

If non-native speech sound learning processes parallel those of visual, 
motor, and word learning, the stabilization phase prior to consolidation 
may be crucial to learning and retention of non-native speech sounds. If 
the stabilization phase is disrupted from exposure to conflicting stimuli or 
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practice on an interfering task, this may hinder consolidation and reten- 
tion of novel speech sounds. 

Although many studies have found robust support for the consolida- 
tion hypothesis, results from other studies challenge its reliability. For 
example, Goedert and Willingham (2002) trained participants on two 
implicit motor tasks with the goal of testing whether these memories 
undergo consolidation and become resistant to interference from learning 
a similar task. The researchers first utilized a serial reaction time task 
in this study, in which participants saw a sequence of circles appear in 
boxes on a screen and pressed a button corresponding to each box after 
a circle was presented. In this paradigm, participants are unaware that 
the sequence is not random but consists of an underlying pattern; there- 
fore, learning is implicit. The second task used in this study was a task 
in which participants learned a new visuomotor mapping. Participants 
were instructed to point at a target on a screen while wearing prism 
glasses that displaced their vision. Training for each task followed a tra- 
ditional interference paradigm (train on task A, train on task B, test on 
task A), and participants were trained on different sequences and visual 
displacements for task B at varying intervals following training on task 
A. Unlike several previous studies, this study did not find evidence that 
the motor memories had been consolidated and become resistant to inter- 
ference, as even 24 hours was not sufficient to protect against interference 
from task B. A study employing similar visuomotor tasks by Caithness 
et al. (2004) additionally found that memories for one task were suscep- 
tible to interference from another task even 24 hours later. Walker et al. 
(2003) demonstrated similar effects in a finger tapping task. Interestingly, 
participants who trained on task A, waited 24 hours, and performed 
task A again before learning task B did not retain their learning of task 
A. Walker and colleagues (2003) posited that reactivation of consolidated 
memories can shift them into labile states, causing them to become sus- 
ceptible to interference once again. Caithness et al. (2004) speculated that 
performance on task B in their study may have been sufficient to reacti- 
vate memory traces of task A, which allowed task B to interfere with task 
A. Goedert and Willingham (2002) largely attributed this ostensible lack 
of consolidation to task differences or neural structures underlying the 
specific task used in their study. 
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An open question is what these memories require in order to be trans- 
ferred into a stable state and resist subsequent interference. It is possible 
that certain tasks or types of learning undergo a different consolida- 
tion process than others or are not consolidated at all. It appears that 
non-native speech sound learning can indeed undergo consolidation as 
evidenced by improvement in the absence of further practice (Earle & 
Myers, 2015a, 2015b; Earle et al., 2017), and it may be the case that a sta- 
bilization or consolidation period would protect newly formed phonetic 
category representations from interference. 


5. Stability and strength of learning 


The studies reviewed in the last section, which encompass both procedural 
and declarative tasks, provide important evidence for consolidation the- 
ories proposed by Miiller and Pilzecker (1900) and Walker (2005): most 
newly acquired memory traces need to undergo a period of stabilization 
in order to become resistant to interference. In some cases, the presence 
of interfering stimuli during the stabilization phase may be strong enough 
to disrupt the consolidation process entirely. An important question to 
address in speech learning studies will be how to minimize interference 
during the stabilization phase or to identify training conditions in which 
information may be consolidated in spite of interference. 

In addition to the passage of time during a stabilization phase, strength 
and stability of initial learning may be an important factor in determining 
whether information is consolidated. Ebbinghaus (1885) first proposed 
that increasing the repetition of practice trials in a task may lead to better 
retention of the information 24 hours later. In addition to the early findings 
by Ebbinghaus, several recent studies in the visual domain lend support 
to this idea. For example, a study by Hauptmann et al. (2005) found 
that participants who practiced a visual task until performance reached 
asymptote improved after a period of sleep, while those who did not prac- 
tice to this criterion failed to show overnight improvement. Tucker and 
Fishbein (2008) trained participants on a series of declarative memory 
tasks and had some take a nap following training. Interestingly, only the 
high-performing participants in training benefitted from sleep, suggesting 
that stronger learning can facilitate overnight improvement. In a study by 
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Shibata and colleagues (2017), participants who overlearned (continued to 
practice after the point of mastery) a visual perceptual task did not experi- 
ence interference from a second task, suggesting that hyper-stable learning 
can accelerate or even obviate the need for a stabilization phase following 
learning. Taken together, these studies suggest that benefits of consolida- 
tion may depend on how strongly information is initially learned. 

Conversely, a few studies have found that sleep preferentially enhances 
recall of weakly learned information or performance on more difficult 
tasks. For example, Drosopoulos and colleagues (2007) had participants 
memorize word pairings and manipulated how strongly the pairings 
were learned. Of the participants who only weakly learned the pairings, 
participants who slept forgot significantly fewer word pairings when tested 
two days later as compared to a wake group. However, sleep and wake 
groups were comparable if the information was strongly learned during 
training. Although sleep did seem to benefit the group that did not learn 
the information as rigorously to begin with, these findings also support 
the benefits of strong initial encoding. Even though the wake group did 
not sleep immediately after learning, they performed equivalently to the 
sleep group. In addition, it seems difficult to rule out ceiling effects in this 
study, as the participants in the strong encoding group performed at over 
95 % accuracy. Similarly, a study using a procedural motor learning task 
found superior benefits of sleep for the most difficult task during training, 
as measured by an increase in speed (Kuriyama et al., 2004). Analogous 
to the Drosopoulos et al. (2007) paper, the participants who learned the 
easier tasks still outperformed the group that learned the more difficult 
task, although the benefits from sleep were not as drastic. 

As shown above, the mixed evidence presented here implies a compli- 
cated relationship between strength of learning and consolidation. Some 
studies suggest that strongly encoded information is advantageous for con- 
solidation, while others show stronger sleep-related benefits for weakly 
learned information. Critically, in the studies showing benefits of sleep 
consolidation for weakly learned information, participants who trained on 
easier tasks or trained on the same tasks to a higher criterion demonstrated 
superior overall performance, which should be considered along with the 
superior benefits of sleep for weakly learned information. Additionally, 
ceiling effects arguably cannot be completely ruled out in these studies. It 
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is also possible that the benefits of sleep are observed in a u-shaped trajec- 
tory because the qualitative changes associated with sleep-mediated consol- 
idation do not always manifest behaviourally. For example, newly learned 
information may need to reach a minimum level of stability in order to 
trigger consolidation, and sleep may show the strongest influence on these 
memories as far as behavioural changes can be observed. However, sleep has 
been shown to induce qualitative changes to memories, such as the ability to 
generalize to new contexts (e.g., Fenn et al., 2013), increased automaticity 
of a task as measured by electrophysiological components (Atienza et al., 
2004), and differential functional activation patterns in response to stimuli 
after sleep (Davis et al., 2009). For example, the study by Atienza et al. 
(2004) trained participants to discriminate auditory tone patterns. Some 
participants slept after training, while another group was sleep deprived. 
They found improved behavioural performance for both groups, regard- 
less of sleep; however, an electrophysiological component that responds to 
the involuntary switching of attention was elicited only in the participants 
who slept after training. Using fMRI, Davis et al. (2009) found changes in 
functional brain activation following sleep in participants who learned new 
words. Specifically, they found activation in the hippocampus before sleep 
consolidation, but after sleep, activation was observed in cortical areas. This 
indicates that the memory traces of the new words underwent qualitative 
changes in how they were represented in the brain. Therefore, sleep may 
indeed benefit learning or qualitatively reorganize information, even if these 
changes are not always evident in behavioural performance. All things con- 
sidered, the benefits of strong initial learning seem clear: stronger encoding 
or overlearning typically results in better overall behavioural performance, 
and it can protect against interference from subsequent learning and poten- 
tially bestow benefits equivalent to those of sleep for long-term retention. 
This may be an important consideration for learning situations in which 
sleep following training is not possible. 


6. Elucidating findings from non-native speech sound 
learning studies in the context of interference and stability 


Concepts such as stability of learning and interference in memory con- 
solidation may offer a more comprehensive account of some findings 
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from non-native speech sound learning. That fact that parallels emerge 
from several domains of learning (e.g., visual, motor, and word learning) 
may indicate that domain-general encoding, stabilization, and consolida- 
tion processes underlie many different types of learning, including speech 
sound learning. In fact, several studies reviewed above can be viewed 
through the lens of interference theories. For example, the finding by 
Earle and Myers (2015a), that native language exposure interfered with 
consolidation of a non-native, Hindi contrast, could be explained both 
by non-native speech sound learning theories and interference theories; 
however, a more comprehensive explanation could be arrived at by con- 
sidering these theories together. Although theories of non-native speech 
sound learning (such as those reviewed above) differ on certain details 
and areas of focus, most attribute difficulties in non-native speech sound 
learning to perceptual similarity of native language speech sounds. In 
line with these theories, the stability and robustness of native-language 
phonetic categories may greatly enhance the difficulty of learning per- 
ceptually similar non-native categories. Additionally, both Müller and 
Pilzecker’s (1900) consolidation hypothesis and Walker’s (2005) proce- 
dural memory consolidation model postulate a necessary stabilization 
phase after learning takes place. If native language exposure immediately 
follows training on non-native speech sounds before the new speech cat- 
egory representations have had time to stabilize, native language input 
would interfere with these memory traces and impede or prevent con- 
solidation from taking place. Results from Earle & Myers (2015a), in 
which native language exposure disrupted consolidation of a non-native 
phonetic contrast, diverge from the synthetic speech study by Fenn and 
colleagues (2003), in which sleep was able to recover information that 
was degraded (or possibly interfered with) throughout the course of a 
day. However, Ebbinghaus (1885) and other studies reviewed above sup- 
port the view that strength and stability of learning is crucial to consol- 
idation, and this notion may account for the discrepancy observed in 
these studies. Learning novel acoustic mappings to existing speech cat- 
egories (that is, learning how the unusual synthetic speech signal maps 
to well-developed English phonology), as was done in the Fenn et al. 
(2003) paper, is arguably less difficult than establishing entirely new 
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perceptual categories. It is reasonable to speculate that both the time 
course of learning and consolidation and the strength of initial learning 
work in tandem to selectively consolidate memories. Learning may have 
been more stable for the synthetic speech task in Fenn et al. (2003) than 
the non-native speech sounds learned in Earle and Myers (2015a), which 
would explain why sleep-mediated consolidation was able to recover 
learning of synthetic speech that had decayed throughout the day but 
not the developing representations of non-native speech sounds. 

Strength and stability of learning may further elucidate studies finding 
disruptions of learning or consolidation as a result of speech production 
(Baese-Berk & Samuel, 2016) or phonological variability (Fuhrmeister 
& Myers, 2017). Neither of these studies was designed according to the 
typical interference paradigm (learn task A, learn task B, test on task A); 
however, it appears that speech production and exposure to phonological 
variability resulted in representations that were less stable and less able 
to benefit from consolidation. Ultimately, the precise cause for attenuated 
perceptual learning following speech production remains unclear. Motor 
theories of speech perception posit that articulatory gestures underlie per- 
ceptual representations of speech (see Galantucci et al., 2006 for review). 
According to this view, it is possible that activating motor representations 
interferes with developing representations of speech categories. An addi- 
tional possibility is that engaging native language phonological categories 
in any modality diminishes the strength and stability with which the novel 
categories are learned. Exposure to phonological variability may similarly 
reduce stability of learning: according to attention to dimension models 
of non-native speech sound learning, learners must direct their attention 
to relevant acoustic cues, which are, in many cases, different from the rele- 
vant cues for the first language (Francis & Nusbaum, 2002). Presentation 
of novel speech sounds in different phonological contexts may not allow 
the learner to quickly discover the acoustic cues that are necessary to 
distinguish different speech categories, as formant transitions some- 
times change based on the vowel that follows a consonant. If the learner 
receives conflicting information for different phonological contexts, this 
would likely result in learning that is less stable, which may not benefit 
from consolidation, at least in the short term. 
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7. Promoting consolidation of non-native speech sounds 


With models of interference and stability in memory consolidation in 
mind, it may be possible to improve training programs in order to sup- 
port consolidation of novel phonetic categories. First, it is necessary to 
determine what types of stimuli can interfere with or destabilize this pro- 
cess. Specific tasks or stimuli that have the potential to interfere with 
non-native speech sound learning have not been extensively investigated. 
Nevertheless, by examining the extant literature, it appears that native 
language phonology is one source of interference (Earle & Myers, 2015a). 
Future perceptual training paradigms may induce more robust learning if 
they attempt to minimize exposure to native language phonology until a 
sufficient stabilization period has passed, especially if training takes place 
earlier in the day. If it is not possible to minimize native language expo- 
sure, learners may benefit from longer or more intensive training, in order 
to strengthen or stabilize learning. As seen in the study by Shibata and 
colleagues (2017), hyper-stable learning was resistant to subsequent inter- 
ference, and non-native speech sound learning may show a similar pat- 
tern. Although some similarities between speech sound learning and visual 
or motor learning exist, this may be an area where speech diverges from 
other domains. Specifically, there are few everyday activities that come in 
conflict with the visual and motor tasks utilized in the studies reviewed 
above—for instance low-level line orientation detection tasks or compen- 
sation for perturbation in motor learning. On the other hand, is difficult to 
avoid speech in the real world, which presents challenges for experimental 
design. For example, participants who learn non-native speech categories 
in the laboratory most likely have immediate access to interfering or con- 
flicting stimuli, such as their own speech production or acoustic speech 
input from listening to other talkers. Unless this is experimentally con- 
trolled (i.e., participants stay in the laboratory for an extended period of 
quiet time following training), it is difficult to account for the events that 
happen after a learning session. Due to practical limitations, this has yet 
to be explored. 

As discussed above, work by Baese-Berk and Samuel (2016) suggests 
articulation or production of speech sounds also seems to interfere with 
developing perceptual representations of speech. Interestingly however, 
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work by Bradlow and colleagues (1997) suggests that production accuracy 
can be enhanced by perceptual training alone, and this effect is stable over 
time (Bradlow et al., 1999). A similar study by Neufeld (1979) corroborates 
these results: participants who received perceptual training only were later 
able to produce words in a second language without a detectable non- 
native accent. It may be the case that perceptual training alone can induce 
concomitant improvements in speech production. Because relatively few 
studies have examined the relationship between speech perception and pro- 
duction and its influence non-native speech sound learning, future research 
will ultimately be needed to determine at what point in the learning trajec- 
tory and in what capacity production of new speech sounds is beneficial to 
the learner. With the available evidence, however, it seems that minimizing 
speech production in training, at least in the early stages of learning, may 
result in optimal outcomes for both perception and production of second 
language speech sounds. 

Additionally, differing acoustic cues (e.g., the different formant trajec- 
tories associated with a dental stop in the context of an /i/ compared to 
an /u/ vowel) may be detrimental to stable learning and consolidation of 
non-native speech sounds. Thus, it may be advisable to limit phonological 
variability in training or during the stabilization phase following training. 
While several studies have found advantages of high-variability training 
procedures, these studies have typically taken place over the course of sev- 
eral weeks (e.g., Logan et al., 1991; Lively et al., 1993, 1994; Bradlow 
et al., 1997, 1999). In that case, the slow, procedural learning system may 
have had ample time to discover the regularities in the input (McClelland 
et al., 1995). This would also explain the enhanced generalization abilities 
as a result of this type of training; participants may have developed more 
robust abstract representations of the phonological categories, allowing for 
generalization to novel talkers or phonological contexts. While this may 
be the case, the efficacy of this training paradigm may be limited to certain 
situations. In particular, intensive training over long periods of time may 
not always be possible. For example, some second language classes meet 
only once per week, and this frequency may not be sufficient for learners 
to take advantage of high-variability training. As seen in Fuhrmeister and 
Myers (2017), even minimal exposure to phonological variability during 
a testing phase only (i.e., training was identical) attenuated learning of a 
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non-native contrast in the trained vowel context. Additionally, no over- 
night improvement was observed for those participants. This suggests that 
exposure to variability may not be optimal in certain cases, especially when 
training sessions are sparse. Especially in such situations, it is important 
that learners can consolidate newly acquired information to begin devel- 
oping representations for non-native phonological categories. A more effi- 
cient training method may involve evening training sessions that occur 
close to sleep (to minimize native language exposure afterwards) that 
include only limited variability in the stimulus presentation. In fact, Earle 
and Myers (2015b) found generalization of non-native speech sounds to 
a novel talker following an interval of sleep, even though their training 
tokens consisted of sounds spoken by a single talker and presented in a 
single vowel context. Thus, sleep consolidation processes facilitated gen- 
eralization to the sounds spoken by a new talker. Based on the evidence 
presented, striking a delicate balance between stimulus variability and 
proximity of training to sleep may promote consolidation of the trained 
information to long-term memory, which is essential to building novel 
phonological categories. Even so, second language learners experience dif- 
ferent cues in the real world, and they need to be able to integrate informa- 
tion across them. Ultimately, future research will need to determine which 
factors promote abstraction over different acoustic cues. 

Although we have some evidence as to what types of stimuli have 
the potential to interfere with non-native speech sound learning, many 
questions remain open. For example, visual and motor learning studies 
have often found that not all tasks interfere equally. In the study discussed 
above by Seitz and colleagues (2005), visual stimuli that differed in spa- 
tial location and orientation did not interfere with a perceptual learning 
task. Because primary visual cortex is retinotopically organized, Seitz et al. 
(2005) reasoned that different neurons were responding to visual stimuli 
in different spatial locations and orientations, whereas the same neurons 
responded to the task in which only the direction of offset differed and 
location and orientation remained the same. These same neurons had to be 
overwritten in order to learn the second task. It is unknown whether any 
potential speech analogs for such a task exist. Because primary auditory 
cortex is tonotopically organized, however, it is possible that training on 
similar stimuli at a different frequency (e.g., varying talker gender) would 
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not interfere with training on the first frequency range. However, speech 
may differ from other types of learning due to the complexity of the signal. 
In addition, findings from motor learning suggest that training on an oppo- 
site force, as in Brashers-Krug et al. (1996), interferes with learning of the 
initial direction, and word learning studies using an A-B A-C paradigm 
indicate that a new pairing (C) with an original stimulus (A) interferes 
with recall of the original pairing (A-B). Speech correlates of such findings 
are less obvious, and future research will need to address these questions. 

Next, the time course of consolidation and susceptibility to interfer- 
ence in non-native speech sound learning should be considered. If the con- 
solidation hypothesis or Walker’s (2005) model can be applied to speech 
learning, it is essential to determine under what conditions a stabilization 
phase is necessary, and when so, how long this stabilization period needs 
to be in order to protect memory traces from subsequent interference. 
In the visual, motor, and word learning studies reviewed above, several 
time frames ranging from a few minutes to several hours have proven 
successful in protecting information against interference; however, some 
studies found interference even after 24 hours had passed between training 
sessions on each task. This suggests that domain or task differences may be 
responsible for some of the varied results obtained in these studies. Based 
on the results of Earle and Myers (2015a), it seems clear that non-native 
speech learning would benefit from a stabilization period; however, this 
has yet to be explored in the speech domain. Ultimately, future research 
will need to elucidate the time course of stabilization and consolidation in 
the speech domain. 


8. Interference and second language speech 
learning in naturalistic contexts 


While this review has primarily focused on memory consolidation 
and interference in the context of laboratory learning of non-native 
speech sounds, these concepts may be relevant to naturalistic learning 
environments, as well. For example, an individual’s amount of first 
language use (among other factors) has been found to predict speech 
production accuracy in the second language (Flege et al., 1997; Piske 
et al., 2001). One possible explanation for these results is that habitual 
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interference from the first language obstructs developing second language 
speech category representations, especially if the stabilization phase is con- 
sistently disrupted. Further support for this idea comes from studies mea- 
suring language proficiency following immersion programs. Immersion 
programs have largely been successful for second language acquisition, 
including acquisition of speech sounds (e.g., Anderson, 2004; Cheour 
et al., 2002; Freed et al., 2004). Immersion settings present few opportu- 
nities for interference from the native language, which may allow memory 
traces of second language speech sounds to stabilize and more efficiently be 
consolidated into long-term memory. In addition, the procedural memory 
system can likely develop more robust category representations with the 
time and amount of exposure afforded by the immersion setting to dis- 
cover regularities in the second language speech system. In fact, the study 
by Freed et al. (2004) compared native English-speaking college students 
learning French in a classroom setting in the home country, an intensive 
summer immersion program in the home country, and a semester-long 
study abroad program in France. While students in the study abroad and 
summer immersion programs outperformed the classroom learners after 
the period of study, students in the study abroad program reported much 
more first language use (English) outside the classroom than the summer 
immersion group. Consistent with first language use accounts, students in 
the study abroad group made fewer gains in proficiency than the students 
in the immersion program. Walker et al.’s (2003) findings on reconsolida- 
tion may similarly explain why less frequent first language use contributes 
to better speech perception and production in the second language. In 
this study, they found that practice on a task that had already been con- 
solidated through sleep could reactivate the memory trace of that task, 
causing it to return to a labile state. If participants learned a second task 
after reactivating memory traces of the first task, interference from the 
second task on the first was observed. If upon reactivation, memories 
return to a labile state subject to interference, using the second language 
may reactivate memories and transfer them to this labile state. If frequent 
and intermittent use of the first language interferes with the reactivated 
second language memory traces, it may have the power to interfere 
with reconsolidation of the second language memory traces, especially 
if first language memory traces are much more robust. Thus, studies of 
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interference in memory consolidation may have explanatory power even 
in naturalistic language learning settings. 


9. Speech sound learning across the lifespan 


A common empirical finding in studies of second language acquisition is 
that second language speech sound learning typically decreases with age. 
Many have attributed this to a putative critical period (e.g., Granena & 
Long, 2012); however, others have found a more linear decline in second 
language speech production abilities throughout the lifespan that would 
be less indicative of a critical period (e.g., Flege et al., 1995). Although 
a complete discussion of this issue is outside the scope of this chapter, 
some speculations can be made when considering both models of memory 
consolidation and non-native speech sound learning. Assuming no strictly 
defined critical period exists but non-native speech sound learning becomes 
increasingly more difficult throughout the course of the lifespan, ideas put 
forward in the learning and memory literature may be applicable. For 
example, the same system may underlie acquisition of speech sounds in 
infancy, childhood, and adulthood; however, adults cannot approach the 
speech sound learning task in the same way as an infant or child because 
their prior experience and interactions with the environment are vastly 
different (see Best & Tyler, 2007 for discussion). Infants do not have 
well-established first language speech categories, while adults have devel- 
oped quite robust speech categories in the native language after years or 
decades of exposure. In fact, Burnham et al. (1991) found that speakers of 
a language become more categorical in their perception of native language 
speech sounds over the life span, and Baker et al. (2008) observed that chil- 
dren are less likely than adults to assimilate second language speech sounds 
to first language categories. This may imply that increasingly more stable 
first language categories are less malleable than less-stable categories, such 
as those in childhood. This could have several implications for non-native 
speech sound learning. As predicted by non-native speech perception and 
learning models, the native language exerts a powerful influence on the 
perception and ability to learn non-native speech categories. Additionally, 
theories of learning and memory (e.g., Ebbinghaus, 1885) and findings in 
the visual domain by Shibata et al. (2017) suggest that stable or strongly 
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learned information can cause proactive interference. In this way, theories 
from both of these domains could be taken together to imply that native 
language speech categories interfere with non-native speech sound learning 
in a graded manner: stronger native language categories may increase the 
difficulty in developing new categories. In addition, children may be less 
susceptible to interference in certain types of learning prior to puberty. In 
a study by Dorfberger and colleagues (2007), children before and after the 
onset of adolescence learned a procedural motor task. Younger children 
showed no advantage in learning or retaining the sequence; however, when 
they were trained on an additional, opposing sequence, 9- and 12-year- 
olds demonstrated no evidence of interference of the second sequence on 
the first. Seventeen-year-olds, on the other hand, did show an interference 
effect. This finding lends support to critical period hypotheses but in a 
different way than they are traditionally depicted. Proponents of a critical 
period typically focus on child advantages in learning, but this finding 
suggests adults may be as good as children at learning certain tasks; how- 
ever, their consolidation processes may differ. Some studies investigating 
second language speech learning in children and adults have found supe- 
rior perceptual learning in adults initially, but after some time children not 
only caught up to the adults in performance but actually surpassed them 
(Snow & Hoefnagel-Hohle, 1978). This finding may be consistent with the 
Dorfberger et al. (2007) study: interference from the first language may 
not have influenced children’s learning of novel speech sounds, while it 
may have obstructed adults’ learning over time. 

Simultaneous and sequential bilinguals may additionally offer some 
insight into memory consolidation of speech sounds over the lifespan. For 
example, children who begin learning a second language in early childhood 
seldom have a detectable non-native accent in either language; however, 
late-onset bilinguals often do (e.g., Flege et al., 1995). Thus, it appears 
that infants and young children can learn the speech sound inventory of 
two languages simultaneously without interference from either language. 
Assuming the slow, procedural memory system subserves this process 
(McClelland et al., 1995; Dudai, 2004), this system would discover reg- 
ularities in both sound systems simultaneously, even if speech categories 
in both languages are close in perceptual space. For example, an infant 
learning Spanish and English would learn that the distribution of the 
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bilabial voiced stop consonant clusters around a lower voice onset time 
than the same category in English. The lack of experience with a specific 
language’s speech sound inventory may additionally facilitate the acquisi- 
tion of speech categories in two languages simultaneously. 

Furthermore, evidence from bilinguals who learned languages sequen- 
tially may shed light on this process. For example, a study by Antoniou 
et al. (2012) found that early sequential bilinguals who were dominant in 
their second language were influenced by their dominant (second) language 
in their perception of consonants in both languages. This parallels previ- 
ously discussed findings showing effects of first language use on second 
language speech perception and production: less frequent first language 
use (or more frequent second language use) may be a strong factor in the 
development of second language speech categories. In other words, while 
important, age of initial acquisition of a language is not necessarily the 
determining factor in how well the speech system of that language will 
be learned. Ultimately, future research will be needed to elucidate relative 
influences of age of acquisition and amount of language use on the devel- 
opment of second language speech category representations. 


10. Conclusion 


Although a comprehensive account of the non-native speech sound 
learning process has yet to be established, many insights can be gained 
from considering findings from domain-general learning and memory 
studies. In particular, factors such as interference during a critical sta- 
bilization phase following learning and the strength of initial encoding 
may be important considerations when designing training paradigms for 
learning novel speech categories or when interpreting findings from this 
field. Whether models of memory consolidation can be applied to speech 
sound learning without modification remains unclear; however, they make 
concrete predictions for future studies and appear to have a great deal of 
explanatory power within the current literature. 
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Lisa Morano, Louis ten Bosch, Mirjam Ernestus 


Looking for exemplar effects: testing the 
comprehension and memory representations 
of r’duced words in Dutch learners of French 


Abstract: In this study, we tested whether second language (hereafter L2) learners 
can encode in the form of exemplars phonetic variation that does not occur regularly 
in their native language (hereafter L1). Three groups of Dutch learners of French 
performed a long-term repetition priming lexical decision task in which words were 
repeated. The second occurrence (target) of an experimental word either matched or 
mismatched the pronunciation of its first occurrence (prime). When a target matched 
its prime, both tokens had a completely devoiced or a completely voiced high vowel 
in their first syllable. When a target mismatched its prime, the prime had a devoiced 
high vowel in its first syllable, while the target had a voiced high vowel in its first 
syllable, and vice versa. In condition AA and in condition BB we reused the same 
token (albeit different tokens per condition) in case of a repetition match. In con- 
dition AB, we used different tokens for prime and target. The results show that L2 
learners are able to encode phonetic information that does not occur regularly in 
their L1 in the form of exemplars, showing that exemplars are formed before the L2 
phonological filter applies, but only under very limited conditions: when the prime 
is difficult to process and when the matching and mismatching tokens are easily 
distinguishable. Contrary to our expectations, we also found that mismatching 
devoiced primes significantly accelerated the recognition of the voiced B targets. 
We hypothesize that this latter result comes from a higher activation of abstract 
representations after difficult primes. Our results therefore show different processing 
patterns for identical testing conditions using different tokens (conditions AA and 
BB). These results question the use of exemplars in everyday speech comprehension, 
adding to the growing body of evidence that exemplar effects only arise in very 
restricted unnatural conditions. 


Keywords: memory, exemplar models, second language learning, adults, reduced 
words, French, comprehension 
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1. Introduction 


Many researchers now assume that the mental lexicon is hybrid in 
nature (Pierrehumbert, 2002; McLennan, Luce, & Charles-Luce, 2003; 
Goldinger, 2007), containing, for each word, both an abstract representa- 
tion of the word’s pronunciation (i.e. a string of abstract symbols such as 
phonemes), and a cloud of exemplars (i.e. occurrences encountered by the 
listener, each encoding fine acoustic characteristics such as speech rate, the 
speaker’s voice, but also phonetic details). Indeed, purely abstractionist 
or purely exemplarist models of speech comprehension both fail to ac- 
count for all the findings in the literature. For example, listeners’ ability to 
adapt to a speaker’s specific way of talking such as a lisp (i.e. perceptual 
learning; e.g., Norris, McQueen, & Cutler, 2003), or listeners’ ability to 
generalize a phonological rule to new words (e.g., Cristia, Mielke, Daland, 
& Peperkamp, 2013), cannot be explained if their mental lexicons only 
contain exemplars of previously encountered tokens without any degree 
of abstraction. Evidence for exemplars, on the other hand, comes from 
priming experiments (e.g., Tulving & Schacter, 1990), in which it has 
repeatedly been shown that native listeners recognize words faster or 
more accurately when they occur for the second time in the experiment (as 
“targets”) than when they occur for the first time (as “primes”) especially 
if the two tokens share fine, phonologically irrelevant, acoustic character- 
istics such as information about the speaker’s voice (i.e. both the prime and 
the target are uttered by the same person; e.g., McLennan & Luce, 2005). 
These specificity (or exemplar) effects suggest that the participants stored 
the first occurrences of the words with at least some degree of acoustic 
detail, that is, in the form of exemplars. 

Nearly all experiments investigating exemplars have been conducted 
with native (L1) listeners. Exemplar research has barely studied second 
language (L2) learners. Nevertheless, there is much to gain from research 
with L2 listeners. First, if exemplars play a substantial role in speech com- 
prehension, as most researchers currently assume, the findings obtained 
with L1 listeners should generalize to L2 listeners, as it is unlikely that 
listeners use two different mechanisms for speech comprehension in a L2 
and in their L1. Second, research with non-native listeners may provide 
information about which acoustic details are exactly stored in exemplars. 
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Are exemplars faithful representations of the acoustic signal or are they 
affected by the listener’s linguistic knowledge? That is, for L2 listeners, are 
exemplars formed before or after their L1 phonological filter (Troubetzkoy, 
1939) applies? 

It has been shown that L2 learners’ abstract representations diverge 
from those of natives. Pallier, Colomé, and Sebastian-Gallés (2001) found 
that even highly proficient Spanish-Catalan listeners treat all minimal pairs 
specific to Catalan as homophones in a lexical decision task with medium- 
term auditory implicit repetition priming: for Spanish-Catalan bilinguals, 
[neta] ‘granddaughter’, primed equally well [neta] and [neta] ‘clean’, and 
vice versa. Proficiency appears to play an important role, as was shown in 
another study. Darcy, Dekydtspotter, Sprouse, Glover, Kaden, McGuire, 
and Scott (2012) tested intermediate and advanced American English 
learners of French on two front vs. back rounded vowel contrasts in 
French (/y/-/u/ and /e/-/o/), which do not occur in English. In a lexical 
decision task with implicit repetition priming, both the intermediate and 
advanced learners patterned like the natives on the /a/-/o/ contrast, albeit 
with slower reaction times, while the intermediate learners, but not the 
advanced learners, treated the /y/-/u/ minimal pairs as homophones. This 
suggests that the intermediate learners did not distinguish /y/ and /u/ in 
their lexical representations, while the advanced learners did. These studies 
suggest that L2 phonological variation that is irrelevant in listeners’ L1 is 
not immediately stored in listeners’ L2 abstract representations, and that 
it may, or may not, eventually be stored abstractly at higher proficiency 
levels. 

Exemplars in L2 listeners need not be different from exemplars in L1 
listeners since L2 listeners have been shown to remain sensitive to L1 irrel- 
evant contrasts provided the task employed could be performed without 
requiring lexical processing such as a phoneme categorization task 
(Sebastian-Galles & Baus, 2005; Diaz, Mitterer, Broersma, & Sebastian- 
Galles, 2012). That is, L2 listeners are able to perform simple low-level 
tasks in phonetic mode but as soon as linguistic processing is required, 
such as for a lexical decision task, then their L1 phonological filter prevents 
them from processing the stimuli in a native-like fashion (with the notable 
exception of Darcy et al.’s, 2012, results). If exemplars are formed before 
the phonological filter applies, L2 exemplars can thus well encode L1 
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irrelevant variation. If exemplars are formed after the phonological filter 
applies, L2 exemplars probably encode less L1 irrelevant variation. Our 
research question was the following: Are L2 intermediate learners able to 
encode, in the form of exemplars, fine linguistic details about the proper- 
ties of the prime that are not relevant in their L1, and to subsequently use 
them for speech comprehension (i.e. to comprehend the target)? 

As previously mentioned, very little exemplar research has been carried 
out in L2. We could only find two studies reporting exemplar effects for L2 
listeners. Trofimovich (2005) tested American English learners of Spanish 
in an immediate repetition task. The participants first listened to a list 
of 36 prime words uttered by three male and three female speakers (the 
study phase). The participants then performed a 3—4 minute distractor 
task, followed by an immediate repetition task (the test phase) in which all 
the primes were repeated (as targets) either in the same voice as during the 
study phase, or in a different voice from the opposite gender, along with 
new words. These tasks were performed twice: once in English and once 
in Spanish, the task order being counterbalanced over all the participants. 
In their L2, the participants were faster at repeating the words previously 
heard in the same voice than words which had not been presented during 
the study phase, but they were equally fast at repeating words heard for 
the first time in the experiment as words previously heard in the experi- 
ment in a different voice. The participants thus treated L2 words repeated 
in a different voice just as new items in the test phase. 

In their L1, Trofimovich’s participants showed priming but no exem- 
plar effects: the participants were faster at repeating English words already 
heard in the study phase than words which had not been presented in the 
study phase, but it did not matter whether those words were uttered in the 
study phase in the same or in a different voice. Although Trofimovich’s 
study did not replicate previous studies which found exemplar effects for 
native listeners (e.g. Craik & Kirsner, 1974; Palmeri et al., 1993; Luce & 
Lyons, 1998), it shows that exemplar effects can be found for L2 learners. 

Further evidence that L2 listeners can store exemplars was provided by 
Winters, Lichtman, and Weber (2013). The authors tested three groups of 
listeners in German: English monolinguals, English learners of German, 
and German monolinguals in an old/new auditory categorization task. The 
stimuli were monosyllabic consonant-vowel-consonant (CVC) German 
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words, which varied in frequency of occurrence (low, medium, high), and 
were uttered by five female voices in one block and five male voices in 
another block (the order being counterbalanced over participants). Within 
each block, half of the words were repeated either with the same or a dif- 
ferent voice. The authors found that target words presented in the same 
voice as their primes were classified correctly more often than target words 
presented in a different voice, irrespective of the listener group. 

L2 listeners are thus able to store details about the speaker’s voice in 
the form of exemplars. This may not come as a surprise since L2 listeners 
already have ample experience in processing indexical variation in their 
L1, and it has been shown that the ability to use consistent information 
about a speaker’s voice across items is easily transferable to L2 speech per- 
ception (Bradlow & Pisoni, 1999). The question is whether L2 listeners 
not only store in exemplars indexical information but also phonetic vari- 
ation that occurs regularly in their L2 but not in their L1. While exemplar 
effects encoding indexical variation have already been attested by Winters, 
Lichtman, and Weber (2013) and Trofimovich (2005), to our knowledge, 
no previous study has found exemplar effects encoding L2 phonetic var- 
iation that does not occur regularly in the listener’s L1. In this study, we 
tested whether exemplar effects in L2 listeners can also be found when 
manipulating regular phonetic variation instead of indexical (or speaker) 
variation. 

One way to study exemplar effects for regularly occurring L1 specific 
phonetic variation instead of indexical variation is to focus on pronuncia- 
tion variants of words resulting from reduction. Reduction is the weakening 
or deletion of phonemes or even whole syllables, occurring in informal 
connected speech, compared to the words’ canonical pronunciations, that 
is the pronunciations of words in isolation (Ernestus & Warner, 2011). 
Most previous experiments investigating exemplar effects by manipulating 
linguistic variation focused on categorical variation, substituting one allo- 
phone with another allophone (e.g. [e] with [e] in Pallier et al., 2001; 
and [t] and [d] with [r] in McLennan, Luce & Charles-Luce, 2003). It 
could be argued that in these experiments listeners stored several abstract 
representations (one for each word pronunciation variant) rather than dif- 
ferent exemplars. Using categorical variation therefore makes it difficult to 
attest for the role of exemplars. 
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Reduction reflecting continuous variation, on the other hand, cannot 
be stored abstractly. Such reduction may result in an infinite number of 
realizations, which all activate the same abstract pronunciation variant of 
the word. Reduction reflecting continuous variation is thus an interesting 
characteristic to manipulate in order to test for unambiguous exemplar 
effects. To our knowledge, no previous study has done so. 

In our study, we investigated the reduction phenomenon of phrase- 
medial high vowel devoicing. In casual French, in a noun phrase like la cité 
([la.si.te] ‘the city’), the /i/ can be more or less devoiced (up to completely) 
as the voicing (i.e. vibration of the vocal folds) fails to be re-established 
in time after the devoiced consonant /s/ (Torreira & Ernestus 2010). 
Furthermore, phrase-medial high vowel devoicing in French is a gra- 
dient phenomenon. In their corpus study, Torreira and Ernestus found 
that the high vowels were more devoiced or completely absent after cer- 
tain consonants, the higher the speech rate, and the further away the 
vowel was from the end of the accentual phrase. Given that the same 
variables predict presence and amount of voicing, absence of voicing is 
the end of a continuum that is reached in extreme devoicing conditions. 
This phenomenon has never been reported for Dutch, suggesting that it 
is part of the sound pattern of French but not of Dutch. Consequently, 
if Dutch learners of French show exemplar effects in an experiment that 
manipulates phrase-medial high French vowel devoicing, we can con- 
clude that L2 learners can also store, in the form of exemplars, L2 specific 
sound patterns. 

We wished to use a task that requires deep processing of the stimuli 
to approach everyday speech processing. In our study, we used a lexical 
decision task. Although it can be argued that a lexical decision task is a 
very artificial task to investigate speech comprehension, it ensures a deeper 
linguistic processing than an old/new categorization task (or continuous 
recognition memory task) or a shadowing task, which are often used in 
exemplar studies (e.g. Craik & Kirsner, 1974; Palmeri, Goldinger, & Pisoni, 
1993; Goldinger, 1996; Bradlow, Nygaard, & Pisoni, 1999; Trofimovich, 
2005; Mattys & Liss, 2008; Winters et al., 2013). The words’ forms need 
to be accessed to elicit responses from the participants: to decide whether 
a stimulus is a real word or not the participants need to access what the 
word means, even vaguely. 
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We tested Dutch intermediate learners of French in a lexical decision 
task in French in which the experimental words contained a high vowel 
following a voiceless consonant. The experimental words were all 
repeated either as a pronunciation match (i.e. both the high vowel of 
the prime and that of the target were devoiced, or both were voiced) or 
as a pronunciation mismatch (i.e. when the high vowel of the prime was 
devoiced, the vowel of the target was voiced and vice versa). If participants 
react faster to a target when it matches than when it mismatches the 
pronunciation of its prime, we can conclude that L2 participants show 
exemplar effects, indicating that they are able to store, in the form of 
exemplars, phonetic information that does not occur regularly in their 
L1, and to later on reuse those exemplars to comprehend the next token of 
the word. 

We ran the same experiment three times. In condition AB, we used dif- 
ferent recordings for prime (a voiced or devoiced token A) and target (a 
voiced or devoiced token B). As already pointed out by Hanique et al. 
(2014), using two different tokens (or recordings) for prime and target 
represents a more ecologically valid testing condition than using iden- 
tical tokens, given that in daily life, we never hear the exact same token 
twice: in a conversation, if a person repeats a word, she will produce a new 
token that will vary slightly from the first one. 

We compared this condition with two conditions in which the prime 
and target were identical in case of a match (like in nearly all the previous 
studies on exemplar effects): one using only the tokens used in the first 
condition as primes (condition AA), and one using only the tokens used in 
the first condition as targets (condition BB). 


2. Method 
2.1. Participants 


We tested 120 Dutch university students who had studied French for 
four to seven years in high school (intermediate level, or B1-B2 levels of 
the Common European Framework of Reference for Languages, CEFR, 
Council of Europe, 2011) and who were paid for their participation. The 
participants were between 18 and 29 years old (mean: 21.74), 95 were 
female and 105 were right-handed. None of the participants reported any 
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hearing problems. The participants were randomly assigned to one of the 
three conditions (AB, AA, BB). 


2.2. Materials 


Our experimental words were selected from the vocabulary of two 
beginners’ textbooks used in French classes at Dutch secondary schools 
(Franconville and Grandes Lignes). They were bisyllabic words containing 
a high vowel (/i/, /y/, or /u/) following a voiceless consonant in their first 
syllable (cf. Appendix 1). Out of all possible words, we selected the 24 
most frequent words, with a preference for those containing /i/ and /y/ as 
these vowels are more constricted than /u/, which allows them to be more 
easily devoiced than /u/ (Meunier, Meynadier, & Espesser, 2008)!. The 
frequency of occurrence of our experimental words in the movie subtitles 
corpus of Lexique 3.81 (New, Pallier, Ferrand, & Matos, 2001) ranged 
from 0.71 (per million words) for cycliste ‘cyclist’? to 107.92 for sujet 
‘subject’ (mean: 31.40, cf. Appendix 1), that is, they were fairly frequent 
words (most of them ranging between the median at 8 occurrences per 
million words and the third quartile at 43 occurrences per million), which 
is normal for beginners’ vocabulary words. 

We also selected 78 bisyllabic frequent words, without particular 
restriction, from the aforementioned beginners’ textbooks to be used as 
existing-word fillers. Finally, we created 102 bisyllabic pseudo-word fillers 
by adding a phonotactically legal syllable to the first syllable of all the 
experimental and existing-word fillers already selected. 


1 One of the reviewers attracted our attention to the fact that the participants may 
not process the devoiced vowel at all despite the remaining durational and for- 
mant cues signaling the presence of the vowel (as it has been shown to happen 
for German natives listening to Japanese accented German; Zimmerer, Rei, & 
Reetz, 2013). In that case, three items could be confused with other French 
words (purée ‘mashed potatoes’ could be confused with pré ‘meadow’; pilote 
‘pilot’ with the reduced form of pelote ‘woolen ball’; and poulet ‘chicken’ with 
plaie ‘wound’). However, the occurrence frequencies of the possibly confounded 
words (pré ‘meadow’; pelote ‘woolen ball’; plaie ‘wound’) are all lower than the 
occurrence frequencies of our stimuli, making it unlikely that our participants 
knew these words. 
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All the stimuli, preceded by their definite determiners, were recorded in 
a sound attenuated booth with a head mounted microphone at 44.100 Hz 
by the first author of this paper, a female French native speaker from Caen. 
The easiest way to obtain fully devoiced high vowels in the first syllables 
appeared to have the speaker produce all the experimental words without 
their determiners. In this way, for the “devoiced” (that is: ‘containing a 
devoiced high vowel in the word’s first syllable’) recordings, the speaker 
could comfortably whisper the first syllables and then voice the second 
syllable, while for the “voiced” (that is: ‘containing a voiced high vowel 
in the world’s first syllable’) recordings, she could just speak out loud 
the whole words. The first vowel of the devoiced stimuli was always 
completely devoiced and the first vowel of the voiced stimuli was always 
fully voiced (cf. Figure 1). The speaker also recorded all the experimental 
words with their determiners. The best devoiced and voiced recordings 
without determiners were then each paired with their closest voiced 
recordings with determiner in terms of intonation and duration. The final 
stimuli were obtained by cross-splicing the voiced determiners with the 
devoiced and voiced recordings without determiners. 

We created two tokens for each voicing type, meaning that for each 
experimental word we obtained four tokens: a voiced token A, a voiced 
token B, a devoiced token A, and a devoiced token B. Tokens A were on 
average 805 ms long (804 ms for the voiced ones, SD = 106, and 806 ms 
for the devoiced ones, SD = 124) and tokens B were on average 811 ms 
long (796 ms for the voiced ones, SD = 134, and 826 ms for the devoiced 
ones, SD = 136). Note that for the B tokens, it is not the case that the 
devoiced form was always longer than the voiced one (cf. Appendix 1 for 
the durations of all individual tokens). The existing-word fillers and the 
pseudo-word fillers were not cross-spliced but two tokens were recorded 
per word-type. The average duration of the existing-word fillers was 719 
ms (SD = 120) and of the pseudo-word fillers 739 ms (SD = 128). 

Finally, all the stimuli were scaled to 70 dB of average intensity. All the 
stimulus recording, editing, and scaling was performed in Praat (Boersma 
& Weenink, 2017). 

The lexical decision task consisted of two blocks of 132 trials each. 
Twelve of the experimental words were presented in the first block and 
12 in the second block. Within each block, the experimental words were 
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Time (s) 


Frequency (Hz) 
Frequency (Hz) 


Time (s) 


Figure 1: Waveforms (top panels) and spectrograms (bottom panels) of the target 
word le silence ‘the silence’: voiced token A on the left, and devoiced token A on 
the right. The high-vowel /i/ boundaries are indicated by the vertical lines. 


repeated either as a variant match (i.e. both prime and target had either 
voiced or devoiced vowels) or as a variant mismatch (i.e. when the prime 
was voiced, the target was devoiced, and vice versa). The prime and target 
were separated by seven to 98 trials (average: 65), replicating the lags 
used in the first and third experiments of Hanique, Aalders, and Ernestus 
(2014). Although these lags are not as long as the ones used by Goldinger 
(1996), who found exemplar effects one week after presentation of the 
prime, they are long enough to ensure that our results could not stem from 
the participants holding the primes in their working memories until they 
could process the target. 

The remainder of the trials per block included 36 bisyllabic real- 
word fillers (of which six were repeated), and 48 bisyllabic pseudo-word 
fillers (of which 18 were repeated). Finally, six real-word fillers and six 
pseudo-word fillers were used for practice trials, with two real-word 
fillers and two pseudo-word fillers being repeated. The practice trials 
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Table 1: Average absolute temporal differences (in 
milliseconds) between primes (voiced and devoiced) and 
targets (voiced and devoiced) per condition. Standard 
deviations are given between parentheses. 


Condition Match Mismatch 
AB 44 (25) 53 (36) 
AA 0 49 (30) 
BB 0 50 (35) 


were the same for all the participants, and they were very similar in 
frequency of occurrence and phonological structure to the stimuli in the 
experiment. 

We created five pseudo-randomizations of the trials: a block never 
started with an experimental word; there were never two experimental 
words in a row; there were never more than eight pseudo-word fillers in 
a row; and a prime and a target were never separated by more than 100 
trials. For each pseudo-randomization, we then created four different 
stimulus lists that kept the trial order obtained by pseudo-randomization 
constant and differed only regarding the voicing type of the experimental 
words. In each of the four stimulus lists, the primes and targets of half of 
the experimental words occurred in the same pronunciation variant (six 
voiced ones, and six devoiced ones), and those of the other half showed a 
difference in voicing (six voiced primes followed by devoiced targets, and 
six vice versa). Consequently, across all four stimulus lists created from 
one pseudo-randomization, each experimental word was tested for each 
of the four possible matching and mismatching combinations. Each of the 
20 lists created in total were randomly assigned to two participants per 
condition. 

In Condition AB, we used different recordings (or tokens) for the primes 
and the targets, so that even in case of a match, the prime (token A) and 
the target (token B) were different recordings. As shown in Table 1, in the 
condition AB, the primes and targets matched in pronunciation variant 
but diverged in terms of duration. In condition AA and in condition BB, 
we only used the tokens A and B, respectively, so that in case of a match, 
prime and target were the same token and thus did not differ in duration 
(hence the zeros in Table 1). 
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2.3. Procedure 


The participants were tested individually in a sound attenuated booth 
equipped with headphones, a mouse, and a button box with stickers JA 
‘yes’ / NEE ‘no’ on the buttons. The participants first signed a consent form 
and filled in a language background questionnaire, before doing the lex- 
ical decision task. The lexical decision task was presented with PsychoPy 
(Peirce, 2007). The participants were instructed to indicate as fast as pos- 
sible with the button-box, using their dominant hand, whether the word 
they heard over the headphones was a real word in French or not. The 
instructions insisted that the participant did not need to know the exact 
meaning of the word in order to press the ‘yes’ button but that they had 
to be certain that the word occurred in French. The next trial initiated 
1000 ms after the participant’s answer or 3500 ms after the onset of the 
preceding stimulus in case the participant did not react. In order to increase 
motivation and discourage guessing, the participants received feedback in 
percentage accuracy at the end of each block. The whole experiment ses- 
sion lasted a little less than half an hour. 


3. Results 


One participant in condition AB and one participant in condition BB were 
removed from the dataset since their accuracy on the experimental words 
in the lexical decision task was below chance level (43.75 % and 33.33 %, 
respectively). 

We analysed all the data from this study using the software R (R 
Development Core Team, 2007). All the trials to which the participants 
did not react were discarded (ten out of the 5664 experimental word 
trials). Accuracies were analysed by means of a linear mixed effects model 
for logistic regression (Jaeger, 2008), for which the dependent variable 
was the probability of a correct response. Reaction times (RTs; measured 
from word offset) to correct trials within 2.5 standard deviations from the 
targets’ grand mean (345 ms; discarding 52 data points out of 1768; 3 % 
of the data) were analysed by means of mixed effects regression models 
(Baayen, Davidson, & Bates, 2008). Prior to analysis, all RTs and stimulus 
durations were log-transformed. Our dependent variable for the linear 
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mixed effect model was thus the log-transformed RT. We used item and 
participant as crossed random effect factors. 

Our predictors of interest were Voicing (a categorical predictor 
indicating whether the first high vowel of the stimulus was voiced 
or devoiced), Condition (AB, AA, and BB) or Token (A or B), and 
Repetition match (i.e. whether the prime and target of the experimental 
word were of the same pronunciation variant). Since Condition and 
Token overlap considerably in terms of the variation they explain, for 
each model reported, we compared two variants of our best model: one 
using Condition and one using Token in order to select the best of the 
two predictors. We retained in our final model the predictor which 
lowered the Akaike Information Coefficient (AIC) of the model by at 
least two points. 

Our control predictors were: log Stimulus duration, Trial number (i.e. 
the position of the trial in the experiment, in order to control for learning 
or fatigue effects), Distance (lag) between prime and target (in number 
of intervening trials), log RT to the previous trial (so as to control for 
local speed effects), and log RT on the prime. The continuous and discrete 
numerical predictors, that is, all the control predictors, have been centred 
around the mean. 

We first fitted a simple main effects model with all the predictors rel- 
evant to the dependent variable. Interactions were then tested between 
the predictors of interest only. To obtain the most parsimonious yet ade- 
quate model, only predictors and interactions which showed significant 
effects (i.e. t or z with an absolute value exceeding 1.96) were retained in 
the final models. Predictors which were significant in an interaction, but 
not as main effects were kept in the models as well. Once the fixed effect 
structure was finalized, random slopes on item and participant were 
tested for all fixed effects. A random slope was kept in the final model 
exclusively when supported by likelihood ratio tests (i.e. p<0.05). Finally, 
following Baayen (2008), to ensure no significant effect was driven by 
outliers, the final RT model was refitted: RTs with residual standard 
errors more than 2.5 standard deviation units were excluded from the 
dataset of the final statistical model (49 data points were removed out 
of 1716; 3 % of the dataset). No predictor lost significance as a result of 
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this refitting of the model’. The p values reported were obtained with the 
ImerTest package version 2.0-36 (Kuznetsova, Brockhoff, & Christensen, 
2017). 


3.1. Accuracy data 


The participants’ accuracy was relatively high although not at ceiling 
(83.92 % overall, with 85.52 % accuracy for the pseudo-word fillers, 
86.13 % for the real-word fillers, and 75.70 % for the experimental 
words). Participants’ lower accuracy on the experimental words was prob- 
ably due to the fact that the experimental words were less frequent than 
the real-word fillers and thus less familiar to the participants. 


First occurrences 


We first verified whether the participants were sensitive to the devoicing 
manipulation. To do so, we looked at the participants’ accuracy on the 
primes only (N=2824), since the participants’ accuracy on the targets 
might have been influenced by whether the targets matched or mismatched 
their primes. The results are presented in Table 2. The participants were 
significantly more accurate on the voiced (75.79 %) than on the devoiced 
(66.42 %) tokens A, as indicated by a simple effect of Voicing (cf. Table 2), 
while the difference was not statistically significant for tokens B (75.80 % 
accuracy on the voiced tokens and 73.49 % on the devoiced ones), as 
shown by releveling the variable and rerunning the model (B = 0.16, 
S.E.= 0.23, z= 0.67, p>0.1), and as indicated by the significant interac- 
tion between Voicing and Token (cf. Table 2). We also found a significant 
random slope of Voicing on Item, which indicates that the effect of Voicing 
was significantly larger for some items than others. 


2 One of the reviewers suggested that we use the Median Absolute Deviation 
(MAD; Leys, Ley, Klein, Bernard, & Licata, 2013) to prune our data instead of 
first discarding outliers 2.5 Standard Deviations from the targets’ mean RT and 
then discarding again outliers deviating more than 2.5 standard units from the 
predicted values before re-fitting the model. An analysis of our RT data using 
the MAD is provided in Appendix 2. Importantly, both analyses find the same 
predictors significant. Thus, both analyses come to the same conclusions. 
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Table 2: Statistical model fitting the probability of a correct response to the primes. 
N = 2824. Standard error is indicated by SE. The intercepts represent devoiced 
A tokens’ first occurrences. Predictors and random slopes that did not reach signif- 
icance at the 5 % level were not retained in the model and are not listed in the table. 


Fixed effects B SE Z p< 
(intercept) 0.96 0.29 3.36 0.001 
Token B 0.46 0.19 2.41 0.05 
Voicing voiced 0.63 0.19 3.26 0.01 
Voicing * token voiced * B -0.48 0.21 -2.29 0.05 
Random effects Variance SD 

Item Intercept 1.65 1.28 

voicing 0.49 0.70 
Participant Intercept 0.40 0.63 


In sum, the participants were thus clearly sensitive to the devoicing 
manipulation for the tokens A, but not for the tokens B. That is, to the 
participants, the devoiced and voiced tokens A were more distinguishable 
from one another than the devoiced and voiced tokens B, although this 
was more the case for some experimental words than for others. 


Second occurrences 


Given that the participants were sensitive to the devoicing manipulation 
(at least for the tokens A), we can now investigate whether the participants 
were more accurate on matching than on mismatching targets. When 
only considering the targets whose primes were answered to correctly 
(N=2041), there appeared to be no effect of Repetition match on accu- 
racy, neither as a main effect nor in interaction with Condition or Token. 


3.2. Reaction Time data 


The RT data suggest priming across all conditions (cf. Figure 2): when 
the participants correctly classified both the prime and the target of the 
experimental word as real words, they were on average 106 ms faster on 
the target (345 ms) than on the prime (451 ms). Note that all RTs are from 
word offset. 

We analysed statistically the RTs to the targets answered to correctly, 
provided their primes had also been answered to correctly. The results 
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+h F a prime 
a match 


a mismatch 


200 250 300 350 400 450 500 


AB BB AA 


Figure 2: Reaction times (in milliseconds) from word offset for the experimental 
primes and targets (in match and mismatch cases) when both have been answered 
to correctly, by condition. Error bars: 95 % confidence intervals. N = 3454. 


are presented in Table 3. Almost all our control predictors showed signif- 
icant effects. The participants were faster at answering targets when they 
also answered quickly on the previous trial; when they had recognized the 
prime quickly; when the number of intervening trials between prime and 
target was low; and when the stimuli were short. 

More importantly, all of our factors of interest also showed significant 
effects. The effect of Repetition match differed between the conditions AB 
vs. AA (B = 0.17, S.E. = 0.05, z = 3.27, p<0.01) and BB vs. AA (p = 0.15, 
S.E. = 0.05, z = 3.00, p<0.01), as shown by releveling the variable and 
rerunning the model. Given that the conditions AB and BB thus patterned 
together against the condition AA (cf. Figure 2), it is not surprising that 
Token of the target (A or B) was a much better predictor than Condition 
(the model with Token had an AIC ten points lower than the AIC of the 
model using Condition). 

We also found a main effect of Voicing (see Table 3), without an interac- 
tion of Voicing with Token: participants were slower at processing devoiced 
targets, independently of whether the targets were token A or token B (cf. 
Figure 3). That is, contrary to the Accuracy data, which showed that the 
participants were only sensitive to devoicing for the tokens A, the RT data 
show that the participants were sensitive to the devoicing manipulation 
for both tokens A and tokens B. L2 listeners were thus sensitive to L1 
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m match 


a mismatch 


Voiced A Devoiced A VoicedB Devoiced B 
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Figure 3: Reaction times (in milliseconds) from word offset for the experimental 
targets which have been answered to correctly both at prime and target, grouped 
by voicing and by token. Error bars: 95 % confidence intervals per bar. N=1727. 


irrelevant information. Interestingly, the significant main effect of Token 
without an interaction of Voicing and Token indicates that the tokens B 
were processed significantly faster (i.e. were easier to comprehend for the 
participants) than the tokens A, independently of whether the tokens were 
voiced or not. 

Repetition match was significant in interaction with Token on the one 
hand (as previously mentioned) and with Voicing on the other hand. The 
three-way interaction was not significant (x?(2) = 0.79, p>0.1). The signif- 
icant simple effect of Repetition match indicates that when the target was 
the devoiced token A, the participants were faster at answering the target 
when it matched its prime than when it mismatched it prime. 

The significance of Repetition match in the other three cases (i.e. when 
a devoiced B target matched its prime, when a voiced B target matched its 
prime, and when a voiced A target matched its prime), is difficult to assess 
from Table 3 given the separate significant simple effects of Voicing and 
Token on the one hand, and their significant interactions with Repetition 
match on the other hand. In order to understand the overall effect of 
Repetition match, we analysed the different contrasts using releveling. 
By releveling, the model does not change, but the mathematical formu- 
lation makes it possible to determine the simple effects in the other three 
cases. We placed alternatively on the intercept of the model reported 
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in Table 3, the voiced tokens A, the voiced tokens B, and the devoiced 
tokens B. Only for the voiced tokens B did we find a significant effect of 
Repetition match (8 = 0.17, S.E. = 0.03, z = 5.02, p<0.001), indicating 
that the participants were significantly slower when the voiced targets B 
matched their primes (either voiced primes A or voiced primes B). In other 
words, the participants were significantly faster when the voiced targets 
B mismatched their primes than when the voiced targets B matched their 
primes. For both the voiced tokens A and the devoiced tokens B, the main 
effect of Repetition match was not significant. 

In sum, the participants were sensitive to the devoicing manipulation as 
they were less accurate on the devoiced than on the voiced primes A, and 
they were slower on both the devoiced A and B targets than on the voiced 
A and B targets. Repetition match showed no effect in the Accuracy data, 
possibly because of lack of statistical power. In the RT data, the A and B 
tokens patterned differently regarding the effect of repetition match: the 
devoiced A tokens were answered to faster when they were preceded by a 
matching prime, while the voiced B tokens were answered to significantly 
faster when they were preceded by a mismatching prime. In other words, 
devoiced primes always shortened the participants’ RTs on the targets, 
while voiced primes never led to any significant differences in RTs between 
a matching and a mismatching target. 


4. General discussion 


This study investigated whether L2 learners show exemplar effects for var- 
iation in the acoustic signal that they are not familiar with from their 
L1. If exemplars are formed after the L1 phonological filter applies, L2 
exemplars do not differ from L1 exemplars regarding indexical variation, 
but only regarding L1 irrelevant linguistic variation. 

We tested Dutch intermediate learners of French in a lexical decision 
task in which words were repeated (i.e. using long-term implicit repeti- 
tion priming) in the same (match) or in a different (mismatch) pronun- 
ciation variant. Our experimental words were French words whose first 
vowel was voiced in one pronunciation variant (voiced word tokens) and 
devoiced in the other (voiceless word tokens). Vowel devoicing is not a 
characteristic of Dutch and thus linguistically irrelevant for Dutch native 
listeners. 
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In order to investigate whether the match effect is not only present 
under the conditions normally tested in exemplar experiments, but also 
under more ecologically valid conditions, we tested three conditions. In 
two conditions (AA and BB), the prime and target were identical tokens 
in the pronunciation match case. These two conditions follow the vast 
majority of the previous literature on exemplar effects (which reuses the 
same token). In a third condition (AB), the primes and targets were always 
different instantiations, so that, even when an experimental word was 
repeated as a pronunciation variant match, it was nevertheless a different 
token, just like in everyday conversations. 

Our data suggest an exemplar effect for the devoiced A targets, since the 
devoiced A tokens were answered to faster when they were preceded by 
devoiced A primes than by voiced A primes. This match effect shows that 
L2 listeners are able to encode and store in the form of exemplars pho- 
netic variation that does not occur regularly in their L1 (vowel devoicing). 
Exemplars thus seem to be formed before the phonological filter applies 
and to faithfully represent the acoustic signal. The information they encode 
is probably the same for both native and non-native listeners. 

If exemplars are formed before the phonological filter applies, one may 
wonder whether exemplars are part of the mental lexicon. This question has 
also been raised by Goldinger (2007), Cutler, Eisner, McQueen, and Norris 
(2010), Ramus, Peperkamp, Christophe, Jacquemot, Kouider, and Dupoux 
(2010), and Nijveld, ten Bosch, and Ernestus (2015), among several authors, 
who hypothesise that exemplars are stored in episodic memory, which is a 
general type of memory (Tulving, 1985). Episodic traces are detailed memory 
representations which are context-dependent in the sense that they encode 
specific events (e.g. listening to a word, watching a movie, hurting one’s toe) 
with their context (e.g. which voice uttered the word, in which row one was 
seated, how early it was). If exemplars are faithful representations of the 
acoustic signal, they are likely to be part of episodic memory. 

The significant interaction we found between Repetition match and 
Token indicates that our participants used different processes to compre- 
hend the B and the A tokens. Although conditions AA and BB both used 
identical tokens for matching primes and targets, they did not pattern in 
the same way in the participants’ RT behaviour on the targets. Rather, 
the BB condition patterned with the AB condition. In both conditions, 
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there were no exemplar effects. It is thus not the fact that the prime and 
the target were identical that led to exemplar effects. These results are in 
contrast with all previous studies on exemplar effects, including Hanique 
et al.’s (2014), which showed that exemplar effects can also arise when the 
prime and target are different tokens in the match condition. 

Various explanations have been put forward to explain why exemplar 
effects arise in certain conditions and not in others. One hypothesis is that 
exemplar effects occur when speech processing is slow, such as when lis- 
tening to dysarthric speech (Mattys & Liss, 2008), or when real words need 
to be distinguished from very real-word-like pseudowords (McLennan & 
Luce, 2005). This time-course hypothesis (McLennan & Luce, 2005) can 
explain the presence versus absence of exemplar effects as the participants 
were slower on the targets A than on the targets B. 

The time-course hypothesis, however, cannot account for mismatch 
effects. Our data showed one mismatch effect. Participants responded 
more slowly to voiced B tokens when they were preceded by voiced than 
devoiced tokens. This raises the question of where this effect comes from. 
This is an important question since it may provide some insight into the 
conditions leading to exemplar effects, and therefore to the nature of 
exemplar effects. The difference in results between conditions AA (match 
effect for devoiced tokens) and BB (mismatch effect for voiced tokens) is 
the most interesting one, since both conditions used identical tokens for 
prime and target and it is therefore not obvious what drives the difference 
in response pattern. 

It may be the case that the difference in response pattern is due to 
subtle acoustic differences between the set of A tokens and the set of B 
tokens. The voiced and devoiced tokens were probably more different 
from each other in condition AA than in condition BB. The selection of 
the tokens for the primes and target for condition AA was made before 
the selection of the tokens for condition BB and from the same pool of 
recordings. Consequently, for the cross-splicing of tokens B, the first 
author had fewer recordings to choose from than for the cross-splicing 
of tokens A, which probably caused voiced and devoiced tokens A to be 
better matched than voiced and devoiced tokens B on other acoustic char- 
acteristics than devoicing. This was definitely true for stimulus duration 
(cf. Appendix 1): the voiced and devoiced tokens A only differed by 2ms 
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on average, while the voiced and devoiced tokens B differed by 28.5 ms 
on average’. 

To further investigate potential differences between the voiced and 
devoiced tokens which might have caused our asymmetric results in the 
AA and BB conditions, we conducted a post-hoc spectral comparison of all 
voiced and devoiced tokens, using the differences along the Mel Frequency 
Cepstral Coefficients alignment path, time warped. The results are sum- 
marized in Appendix 3. We found that the voiced and devoiced A tokens 
differed more from each other than the voiced and devoiced B tokens. 
However, this difference was not significant (t(45) = -0.27, p>0.1) probably 
because of lack of statistical power. Consequently, it is possible that the 
difference between voiced and devoiced vowels stood out less clearly for 
the B tokens than for the A tokens, especially given the accuracy differences 
found between the voiced and the devoiced primes: the participants were 
about 9 %, and significantly more accurate on the voiced than on the 
devoiced primes A, but only 2 % more accurate on the voiced than on the 
devoiced primes B, and this latter difference was not statistically significant. 

The participants’ significantly lower accuracies on the devoiced 
A primes compared to all other primes, in combination with their signif- 
icantly lower RTs on both the A and B devoiced targets compared to the 
voiced targets could explain our pattern of results. On the one hand, the 
difficulty of processing of both the A and B devoiced tokens could have 
led the participants’ abstract representation to reach a higher level of acti- 
vation (as activation only increases over time, e.g. Norris & McQueen, 
2008) than after the processing of a voiced prime (for which activation 
stopped to increase as the word was recognized earlier in time). When a 
voiced target then followed a devoiced prime, the ease of processing of 
the voiced forms combined with the high activation of the abstract repre- 
sentation, led to a quicker answer on a mismatching than on a matching 
target. On the other hand, the fact that the devoiced A primes were par- 
ticularly difficult to comprehend could have led to stronger individual 


3 This difference in stimulus duration probably stems from a difference in the 
duration of the high vowel (cf. Figure 1). Importantly, 28.5 ms are above 
the threshold of just noticeable differences for vowel duration (Quené, 2007; 
Nooteboom & Doodeman, 1980). 
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memory traces (or exemplars) being encoded for the devoiced A than for 
the devoiced B primes. In turn, these highly activated exemplars would 
then be easy to retrieve and to match to the particularly distinguishable 
devoiced A targets. When both the prime and target were devoiced tokens, 
the participants could thus more easily use the exemplar formed with the 
prime in condition AA than in condition BB. This would explain why there 
was only a match (exemplar) effect in condition AA with devoiced targets. 

This explanation of our asymmetric results would be in line with other 
studies which propose that listeners may display exemplar effects only 
under testing conditions that encourage participants to rely on their recent 
(or episodic) memory. Luce and Lyons (1998) found exemplar effects in 
an old/new categorisation task, which explicitly requires the participants 
to make use of their recent memory, but not in a lexical decision task. 
Hanique et al. (2014) only found exemplar effects in a lexical decision 
task when it was crystal clear to the participants that tokens were repeated 
(when the percentage of repeated tokens was high and the number of inter- 
vening trials between the prime and the target remained low). Moreover, 
they only found exemplar effects when manipulating only linguistic and 
not both linguistic and indexical variation within one experiment. Thus, 
if the stimuli included too much variation, like the tokens B in our experi- 
ment did, no exemplar stood out from the other episodic traces, and con- 
sequently no exemplar could be reused in the matching conditions. 

Other types of variation have been shown to influence the presence of 
exemplar effects. For example, confusability between vowels categories has 
been shown to hinder the benefits of High-Variability training on vowels’ 
identification (Wade, Jongman, & Sereno, 2007), while High-Variability 
training benefits are traditionally explained with more exemplars creating 
a more robust category as a cloud than individual exemplars. It thus seems 
that to produce effects, exemplars need to be clearly recognized or labelled 
by the listener as belonging to two separate clouds or categories. 

So far, we have explained our results within models assuming hybrid 
lexicons. Some other recent models of speech perception answer the problem 
of the lack of invariance of the speech signal by focusing on how listeners 
integrate incoming information from the input with their own predictions 
over the same speech signal, depending on the situation. For example, in 
their ‘ideal adapter’ framework, Kleinschmidt and Jaeger (2015) propose 
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that listeners constantly learn from details in the speech signal to imme- 
diately adapt their expectations about the incoming input. Whereas this 
framework accounts well for adaptations to differences among individual 
speakers stemming from regular and suprasegmental variation within the 
speech input, it is less clear which predictions it would make with regard 
to adaptation to irregular phonetic variation. In our study, participants 
probably noticed that words were repeated, however, they could certainly 
not predict whether the target would match or mismatch its prime. In the 
absence of certainty, we may expect listeners not to adapt, and thus to rely 
on their abstract representations, representing the full forms. Consequently, 
voiced B targets should benefit from a matching voiced prime (meeting the 
listeners’ long-term expectations of the listeners). However, this is not what 
we found. In the AB and BB conditions, a mismatching prime speeded the 
recognition of its voiced target. Our design, however, is not best suited to 
test the predictions of the ‘ideal adapter’ model. More studies manipulating 
irregular phonetic variations with more predictable stimuli are needed to 
test predictive models of speech perception. 

Finally, our results strongly support Hanique et al. (2014)’s claim that 
exemplars probably play a very limited role in everyday speech compre- 
hension given that in our study, not only exemplar effects arose in very 
limited conditions, but we also found significant mismatch effects (i.e. the 
use of abstract representations), even in the very conditions which were 
expected to trigger exemplar effects. It is currently assumed that exemplars 
are used for speech comprehension. However, given Hanique et al.’s result, 
our results, and the many null results reported in the exemplar literature 
(e.g. Luce & Lyons, 1998; McLennan et al., 2003; Mattys & Liss, 2008; 
Hanique et al., 2014, Nijveld et al. 2015), it is quite clear that exemplar 
effects are not so robust. Researching the exact conditions which can con- 
sistently trigger exemplar effects is essential in order to find which role 
exemplars actually play in everyday speech perception. 


5. Conclusion 


Exemplar effects can also be found for L2 learners, even when the prime 
and target encode phonetic information that does not occur regularly in 
the learners’ L1. This shows that exemplars can encode information that 
the phonological filter usually discards, and exemplars must therefore be 
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formed before the phonological filter applies. Exemplars are thus probably 
not part of the mental lexicon. Interestingly, we also found that participants 
displayed different response patterns when presented with different tokens 
of the same words in exactly the same testing conditions. This finding 
particularly questions the robustness of exemplar effects. Hanique et al. 
(2014) already warned that exemplars are probably not used in everyday 
speech comprehension given the limited conditions under which exem- 
plar effects arise. Our study supports this conclusion and extends it to L2 
listeners for whom the conditions under which exemplar effects arise ap- 
pear even more limited. 
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Appendix 1: Experimental word-types (with their translations) classified by their 
high vowel, with their token durations (in ms) and frequencies of occurrence (per 
million words) as reported for movie subtitles in the database Lexique3. Standard 


Deviations from the mean are reported between parentheses. 


High-vowel Word-types 
hil le chinois 
the Chinese language 
la cité 
the city 
le citron 
the lemon 
le cycliste 
the cyclist 
le kilo 
the kilo 
le pilote 
the pilot 
la piscine 
the swimming pool 
le silence 
the silence 
le ticket 
the ticket 
/y/ la cuisine 
the kitchen 
la culture 
the culture 
la fumée 
the smoke 
le futur 
the future 
la purée 
the mashed potatoes 
le succès 
the success 
le sujet 
the subject 
le surnom 
the nickname 
la tulipe 
the tulip 


A B 
voiced devoiced voiced devoiced 
660 730 601 
753-650 728 
681 655 613 
943 923 919 
921 871 907 
944 897 979 
933 1028 910 
898 966 925 
903 865 963 
691 655 632 
853 884 925 
660 710 668 
883 958 846 
933 1028 910 
821 763 811 
700 787 762 
740 752 654 
765 763 762 


694 


682 


599 


884 


955 


895 


972. 


1019 


959 


665 


921 


693 


903 


972. 


862 


741 


794 


871 


Frequency 
Freqfilm2 
21.88 


14.55 


Appendix 1: 


tul 


Average 
(SD) 
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Continued 


la couleur 
the colour 
le couloir 
the corridor 
le courage 
the courage 
la poubelle 
the garbage (can) 
le poulet 
the chicken 
la poupée 


the doll 


1071 
740 
709 
632 
664 
872 


824 
(137) 
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Frequency 
105.53 


39.58 
107.92 
6.20 
13.62 
1.53 


31.40 
(31) 
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Appendix 2: Statistical model fitting the log-transformed response times (measured 
from word offset) to the targets whose corresponding primes have been answered 
to correctly. N = 1647 after removal of the outliers that are 2.5 absolute deviations 
lower or higher than the median. Standard error is indicated by SE. The intercept 
represents the reaction time to a devoiced target A mismatching its prime. 


Fixedeffets b ā E çt 
(intercept) 5.74 0.05 113.00 0.001 
Repetition match Match -0.12 0.04 -2.82 0.01 
Token B -0.05 0.05 -1.10 n.s. 
Voicing Voiced -0.18 0.04 -4.35 0.001 
Number of trials between 0.002 0.0006 2.69 0.01 
prime and target 
Stimulus duration (ms -1.13 0.16 -7.09 0.001 
logged) 
RT to the preceding trial (ms 0.17 0.03 6.23 0.001 
logged) 
RT to the prime (ms logged) 0.30 0.02 15.89 0.001 
Repetition match * voicing match * voiced 0.12 0.04 2.90 0.01 
Repetition match * token match * B 0.16 0.04 3.68 0.001 
Random effects Variance SD 
Item Intercept 0.02 0.15 

Voicing 0.02 0.14 

RT to the preceding 0.007 0.08 

trial 
Participant Intercept 0.04 0.19 

Stimulus duration 0.13 0.37 

RT to the preceding 0.01 0.11 

trial 


Residual 0.17 0.42 
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Appendix 3: Average spectral differences along the Mel Frequency Cepstral 
Coefficient alignment path between primes (voiced and devoiced) and targets (voiced 
and devoiced) time warped per condition. Standard Deviations are reported between 
parentheses. 


Condition Prime Target Condition Spectral 
differences 
AB Voiced A Voiced B Match 755 (109) 
Match 813 (120) 
Devoiced A Devoiced B 
Voiced A Devoiced B Mismatch 1091 (207) 
Mismatch 1092 (256) 
Devoiced A Voiced B 
AA Voiced A Voiced A Match 0 
Devoiced A Devoiced A Match 0 
Voiced A Devoiced A Mismatch 
Devoiced A Voiced A Mismatch 1044 (197) 
BB Voiced B Voiced B Match 0 
Devoiced B Devoiced B Match 0 
Voiced B Devoiced B Mismatch 
Devoiced B Voiced B Mismatch 1028 (218) 
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